From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Andy Wingo <wingo@pobox.com>
Newsgroups: gmane.lisp.guile.devel
Subject: string_abstraction2 review
Date: Thu, 11 Jun 2009 23:12:51 +0200
Message-ID: <m3my8e1nl8.fsf@pobox.com>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Trace: ger.gmane.org 1244754781 15782 80.91.229.12 (11 Jun 2009 21:13:01 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Thu, 11 Jun 2009 21:13:01 +0000 (UTC)
Cc: guile-devel <guile-devel@gnu.org>
To: Mike Gran <spk121@yahoo.com>
Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Thu Jun 11 23:12:58 2009
Return-path: <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>
Envelope-to: guile-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1MErZv-0005yf-AC
	for guile-devel@m.gmane.org; Thu, 11 Jun 2009 23:12:56 +0200
Original-Received: from localhost ([127.0.0.1]:54841 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1MErZu-0000Ft-LI
	for guile-devel@m.gmane.org; Thu, 11 Jun 2009 17:12:54 -0400
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1MErZr-0000Fk-LB
	for guile-devel@gnu.org; Thu, 11 Jun 2009 17:12:51 -0400
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1MErZn-0000FQ-CU
	for guile-devel@gnu.org; Thu, 11 Jun 2009 17:12:51 -0400
Original-Received: from [199.232.76.173] (port=55633 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1MErZn-0000FN-8Y
	for guile-devel@gnu.org; Thu, 11 Jun 2009 17:12:47 -0400
Original-Received: from mx20.gnu.org ([199.232.41.8]:57667)
	by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32)
	(Exim 4.60) (envelope-from <wingo@pobox.com>) id 1MErZm-0006vT-MT
	for guile-devel@gnu.org; Thu, 11 Jun 2009 17:12:46 -0400
Original-Received: from a-sasl-fastnet.sasl.smtp.pobox.com ([207.106.133.19]
	helo=sasl.smtp.pobox.com) by mx20.gnu.org with esmtp (Exim 4.60)
	(envelope-from <wingo@pobox.com>) id 1MErZk-0002Kv-CM
	for guile-devel@gnu.org; Thu, 11 Jun 2009 17:12:44 -0400
Original-Received: from localhost.localdomain (unknown [127.0.0.1])
	by a-sasl-fastnet.sasl.smtp.pobox.com (Postfix) with ESMTP id
	230EDBA5CC; Thu, 11 Jun 2009 17:12:42 -0400 (EDT)
Original-Received: from unquote (unknown [81.38.182.145]) (using TLSv1 with cipher
	DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by
	a-sasl-fastnet.sasl.smtp.pobox.com (Postfix) with ESMTPSA id 88512BA5CA;
	Thu, 11 Jun 2009 17:12:39 -0400 (EDT)
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.92 (gnu/linux)
X-Pobox-Relay-ID: 99C9317C-56CC-11DE-8E90-97731A10BFE7-02397024!a-sasl-fastnet.pobox.com
X-Detected-Operating-System: by mx20.gnu.org: Solaris 10 (beta)
X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6,
	seldom 2.4 (older, 4)
X-BeenThere: guile-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Developers list for Guile,
	the GNU extensibility library" <guile-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/guile-devel>
List-Post: <mailto:guile-devel@gnu.org>
List-Help: <mailto:guile-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=subscribe>
Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.lisp.guile.devel:8666
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.devel/8666>

Howdy good sir!

I've been taking a look at string_abstraction2 this evening. (Well, I
wrote that last evening. Time passes!) I'm really looking forward to it
going in. The concepts seem sound, but I think the branch could be even
better with a rebase. Further comments and impressions:

    commit 6d1c2613b799c86ac183a2f520c789f0afb8cf60
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Sun May 17 20:07:53 2009 -0700

        Gnulib updates to support wide characters

We can probably drop this one entirely, no? [It imports the libunistring
gnulib module, but we depend on libunistring being installed instead.]

    commit ffc654adb020bd8bd1b108e7c7047c8b7da410a3
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Sun May 17 20:33:53 2009 -0700

        Use 32 bit characters

This one looks great. My only comment is that it would be better to keep
SCM_MAKE_CHAR as a pure macro.

    commit dfbcbe20c7d010bc9f0a64f40b0aaae635915405
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Sun May 17 21:03:00 2009 -0700

        String abstraction -- don't unpack symbol characters in eval.

Looks good, though to my eyes there's no need for the "String
abstraction -- " prefix -- in the end these commits will stand alone in
the git history. Please ignore if it's against other input you have
    received, or against your sensibilities :)

commit 53464169b1799036ddee2fcf07b0dd92e3fc8d4b
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Sun May 17 22:36:04 2009 -0700

        String abstraction -- use string operators in hash

Looks good also. Only question is why scm_i_string_ref_to_int32 -- why
not scm_t_wchar, and just call it scm_c_string_ref ?

    commit d311521f6e280aca9a2687624a5b36b38f60cfc7
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Sun May 17 22:55:21 2009 -0700

        String abstraction -- use string operators for filesys

Looks good -- though again, scm_i_string_ref_eq_char doesn't sound right
    as a name. scm_i_string_has_char_at, maybe? Dunno.commit 07ed5be21432fc=
b2b092d372067a002f6662aac5
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Sun May 17 23:24:44 2009 -0700

        string abstraction -- use symbol operators in garbage collector

Same notes for scm_i_symbol_ref_eq_char, and I don't think
scm_i_symbol_ref_to_char should exist -- the cost of that struct vtable
layout cruftiness should probably be borne by struct.c.

    commit 554c99e7c348ce7d02f2337fa2f04d666c6de892
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Sun May 17 23:34:46 2009 -0700

        string abstraction -- avoid unpacking strings when defining goops c=
lasses

Looks fine, though some logic is duplicated. Still it's OK.

Say, you're using tabs in some of these commits -- please don't :-)
(setq indent-tabs-mode nil) if you use emacs.

    commit 6443bf5de7dd727cf5f5ededc016e6c349518da4
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon May 18 06:35:08 2009 -0700

        string abstraction -- avoid unpacking port mode strings

Excelente

    commit 3570ed124988dbcc3f2085c5ef501cd124b945a7
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon May 18 06:42:39 2009 -0700

        string abstraction -- avoid unpacking mode mknod strings

Cool

    commit 12348d6335c2cb9da08badc707dc1bd069844bb1
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon May 18 06:47:43 2009 -0700

        string abstraction -- don't unpack random state string

    The remember_upto_here needs to go just before the return.

    commit 322197c8e2e5aac004290a075d5391b88a66a7f7
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon May 18 06:57:38 2009 -0700

        string abstraction -- avoid unpacking delimiter strings

Do we really need the eq_char version and the eq_wchar version? Can't we
just have the latter, and char casts to wchar transparently?

    commit 4c57c26923c0b6ae5027c76dc32e5251a2f6cbcf
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon May 18 07:23:10 2009 -0700

        string abstraction -- avoid unpacking strings in send/recv

Hm. It seems it would be better for send/recv to take bytevectors, but
to accept strings for back compatibility. It's a somewhat orthogonal
question to your patch, but now that we have bytevectors, this is
exactly what they're for. That way we can deprecate passing strings to
these functions. What do you think?

    commit a462ad13abac7f12fa370fb1724a6eb7d80edcb2
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon May 18 07:50:21 2009 -0700

        string abstraction on charsets and combine redundant string_ref_to_=
XXX

I still thing scm_i_string_ref should just return a wchar :) Regarding
character sets: I understand that Will Clinger has an efficient
implementation of character sets in Scheme that works well with the
sparse nature of Unicode codepoints. Perhaps that could be of use.
Dunno.

    commit 1091a38a8599f41c551686845fa4559f67e898a1
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon May 18 08:27:29 2009 -0700

        Convert strings to utf8 to pass thru posix time funcs

Hmm, this one seems dangerous to me. Would it not be possible for the
UTF-8 serialization of a string to contain a valid % escape sequence?
Looking over the UTF-8 conversion algorithm, I guess not, as only 0x00
to 0x80 have their high bit unset... OK :-)

The only thing is that scm_strptime seems to leak memory if the strptime
failed.

    commit c28492c576da916b93748b3fd4f069eb7b3e5a12
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon May 18 18:42:06 2009 -0700

        string abstraction -- don't unpack strings in struct

Fine, but I still think you can do e.g. foo =3D=3D 'R' even if foo is a
wchar; which obviates the need for those _char functions, and the SCM
temporary variables.

    commit 07c5b853955972aeca744ca197e8169f739f5427
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon May 18 19:01:37 2009 -0700

        string abstraction -- avoid unpacking strings in goops and unif

Looks good

    commit 6106178ffea886eaa10c7b931cd0657e8b871fc2
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Tue May 19 19:46:53 2009 -0700

        string abstraction -- use string functions to access string interna=
ls

This is a truly heroic commit.

Only minor nitpicks: scm_i_string_set needs an _x suffix. REF_EQ_CHAR
could probably be scm_c_string_has_char_at. Tabs should be spaces.
Otherwise, wow.

    commit 5abb7ba75778c36620d381b157eea451d6498c0f
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Tue May 19 20:02:58 2009 -0700

        string abstraction -- avoid unpacking symbol characters

I worry that lookup_interned_symbol could be too slow. Maybe a
scm_i_symbol_has_name that could be implemented in strings.c and just
result in a strcmp ?

I was about to complain about scm_take_locale_symboln maybe losing some
efficiency, but it doesn't seem we use it at all, so that's probably OK.

    commit fdc2be0ae9a022a7ff5f1f83db36e821eac895a3
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Tue May 19 22:29:43 2009 -0700

        string abstraction -- don't unpack strings when converting strings =
to numbers

Another great commit.

    commit 7ba0bb8eff42f110479a687446a8b8831fa75f4a
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Tue May 19 23:30:31 2009 -0700

        Add funcs to lfwrite strings to ports without unpacking them

Looks good

    commit c692ee76f274469ec48e86064b22ae819a498748
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Wed May 20 06:48:21 2009 -0700

        string abstraction -- avoid unpacking strings when unreading strings

cool

    commit 60cc9d00e4ba1f5fbe8dbe3254e0c9d1cc09fb9c
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Wed May 20 06:51:16 2009 -0700

        string abstraction -- avoid unpacking strings when creating obarrays

I don't care about obarrays ;-)

    commit 4a17d5e6fca109bb9943666664fe49e3879c957f
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Wed May 20 07:05:39 2009 -0700

        string abstraction -- avoid unpacking guile strings

Many of these strings could leak if the locale does not validate, though
this point may be moot if/when we merge BDW-GC.  Suggest passing
an already-validated scm_t_locale to compare_strings.

I had no idea these functions existed ;)

    commit c16c6626a522d1c3f32e79ed57103b7676d54e51
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Wed May 20 07:20:24 2009 -0700

        symbol abstraction -- avoid unpacking symbol when making keywords

OK
    commit c9dc2528cfdc97351f5f9182e5b9b66ba2d2c21f
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Wed May 20 07:50:01 2009 -0700

        symbol abstraction -- avoid unpacking symbols when printing them

Looks good. Why does scm_i_print_symbol_name need to be in the header,
though?

    commit 03455b48fe28003b8cf1333ba593224d8fe34e2e
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Wed May 20 21:30:19 2009 -0700

        string abstraction -- avoid unpacking strings in gentemp

OK

    commit b507262d75db26b6034328396b0be8837f2d1437
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Thu May 21 19:06:23 2009 -0700

        Use Latin1 & UTF-32 as internal string encoding

When you rebase, I would do whatever you need to do to Gnulib first, and
not with other Guile changes. Makes the diffs a little easier to read :)
Does our use of libunistring not obviate the need for strconveh et al?

setbinary is a nasty interface, though it may indeed be necessary. At
the least, it should return the previous port encoding.

make_wide_stringbuf should check that the size_t len is within range so
as to prevent integer overflows. Same concern in widen_stringbuf. (This
kind of thing is a common source of security vulnerabilities, as you
probably know :)

scm_i_is_narrow_string should be scm_i_string_is_narrow, as it does
assume the arg is a string. Same with scm_i_is_narrow_symbol.

There is some funny indentation in scm_i_string_chars. I like the
approach taken there, though. Check indentation too at the end of
scm_i_string_writable_chars.

%string-dump has different spacing when dumping narrow vs wide chars.
Check the format strings.

scm_string should declare the return of scm_ilength as an ssize_t
instead of a long. There shouldn't be any errors, and if there are, that
could indicate other problems.

I fear scm_c_make_string could be much slower, given that it's going to
be locking and unlocking a mutex on each character set. Can you work
around this?

scm_from/to_locale_string look *great*.

i18n.test and ports.test need to have their setlocale / setbinary
invocations inside an (eval-when (eval load compile) ...) block, so the
expander does the right thing with them. They should reset the
locale/encoding to their previous values at the end of the tests, as the
tests are all run together in one process.

    But yes, another heroic commit.commit 84df3e211d1eaab8b0735a6f1f27f56e4=
f579875
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Sun May 24 17:15:10 2009 -0700

        add tests for encoding/decoding wide strings

Cool. Also, wow:

+(with-test-prefix
+ "non-ascii variable names"
+
+ (pass-if "1"
+	  (let ((=E8=8A=A5=E5=B7=9D=E9=BE=8D=E4=B9=8B=E4=BB=8B  1)
+		(=C3=B1 2))
+	    (eq? (+  =E8=8A=A5=E5=B7=9D=E9=BE=8D=E4=B9=8B=E4=BB=8B =C3=B1) 3))))

    commit b2d407915c35e5a45d65be9c6c15abcda16b8c87
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon May 25 11:40:46 2009 -0700

        Read and write wide strings.  Allow wide symbols.   Interpret wide =
code.

I understand that making scm_getc inline was quite important for
performance. Is it possible to fast-path the ascii case so scm_getc is
inlined, and only calls out-of-line for the slow case? Is it /really/
the case that ascii alnums represent themselves in all encodings except
UTF-16 and UTF-32 ? And doesn't your fastpath not take the UTF-16/32
cases into account?

I do not envy you for mucking about in ports.c, btw ;)

You note that normally setencoding is called for you by setlocale; but
why would you call it on its own? Also u32 is leaked if the encoding
isn't valid, in scm_setencoding.

read_token is likely much slower than it was now, calling out-of-line to
lock the mutex, etc. Would be nice to inline the string set, in
scm_read_sexp as well.

read_complete_token is a nice simplification.

Can you really assume that newline is newline in all encodings?

The string port business is a bit nasty -- of course, that nastiness
proceeds from the port API itself. But would it not be possible to base
string ports on top of bytevectors, instead of using strings as a poor
man's bytevector? Bytevectors are better than u8vectors because they can
be viewed as other types, and parsed as unicode.

Also, with-output-to-locale-u8vector is a nasty name ;) It probably
doesn't belong in r4rs.scm, which is loaded very early in the boot
process. Our boot files need some re-arranging...

    commit 823e444052817ee120d87a3575acb4f767f17475
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Sun May 24 17:15:10 2009 -0700

        add tests for encoding/decoding wide strings

The tests won't work with the compiler, for lack of eval-when, but we
    can fix that later.commit de1bb55692fc9fc35748ba24da36ee3086068053
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon May 25 15:12:27 2009 -0700

        revert scm_getc to be SCM_API

    scm_getc should probably be the inlined one, as I mentioned before.

    commit a3b9555bd2b58c62f0a1c8431f6d91703e43dbf8
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon May 25 15:17:37 2009 -0700

        workaround to make guile-readline buile

I didn't see why you needed to define _UNUSED_PARAMETER_ -- do our
public headers now require it? (That would be an error.)

    commit 1ffb9fb19ccec5e96d762e073a9bbabbc41962e4
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon May 25 15:19:37 2009 -0700

        vm string loader should use internal encoding of strings

I think instead of this, we should have
load-{latin1,utf32}-{string,symbol,keyword}. Of course that requires
some work to the compiler as well as the VM.

    commit 7994186ddd09f140a00ef04891c966714a5f28e2
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon May 25 15:21:17 2009 -0700

        declare encoding of test source files

Better to merge the scan-for-encoding patches first, no?

    commit f20946c01b53b8022e0044d1ada32e52faf3c577
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon May 25 15:33:17 2009 -0700

        fix broken tests: check locale encoding using u8 vectors

Ah, I see now why you have with-output-to-locale-u8vector. How very,
very unfortunate.

While I haven't gotten to your ports-have-their-own-encodings patches,
would it not be possible to have string ports' encoding be UTF-32? Then
you could simply convert to the locale, and you'd be fine after that.
What do you think?

    commit 9bfb3f8ab6e6d1a134c7562f2bc25392c032653e
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon May 25 15:37:24 2009 -0700

        rename encoding_*.test to encoding-*.test

    commit b123b24a4695d42fa797cd6902f8866eac1659d5
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon May 25 15:39:52 2009 -0700

        add new tests to Makefile.am

Please squash these two. Also, you have spaces in your Makefile changes
where we have tabs ;) Please check test-suite/Makefile.am to be sure.

    commit cd130d9b5fbb6345a57dfde0800be8a9a9257203
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Tue May 26 05:52:42 2009 -0700

        Cosmetic changes. Indenting.

Cool

    commit 1a4e40f36a3a22b0eb507502c54ece487d472056
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Tue May 26 08:56:06 2009 -0700

        move reader encoding parameters to fluids

As a matter of style, I would stash away the fluid as your global var,
not the variable -- because it's the fluid which provides the rebinding
facility. If the function is called before it is initialized, just make
sure the program crashes -- that's a Guile internal error.

So scm_i_get_conversion_strategy () should be simply scm_fluid_ref
(conversion_strategy), then returning the appropriate ilseq handler.

IMO, of course.

Otherwise looks fine.

    commit 39efc84f1569365c05ee98efac9093bbde78c3d7
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Thu May 28 21:52:44 2009 -0700

        use scheme file encoding declared in the comment header.

Wow, this is really wild. You scan bytes and strcmp as ASCII, but then
hope the rest of the buffer is compatible with the original encoding?
How would one encode a file as UTF-16? Do people just not do that?

*after looking* Ah, I get it now, people only encode UTF-16/32 with
byte-order marks. We should catch those in this function, it seems we
only catch the UTF-8 BOM.

    commit 9d52ea2fa4073582113f4235dededb002682e459
    Author: Michael Gran <spk121@yahoo.com>
    Date:   Mon Jun 1 07:47:04 2009 -0700

        Revert the changes of 726b1624a9e3dd1251f18916f2dd0409ffb71e54

Then you revert it here -- but that is not the right SHA1 AFAICT,
perhaps it referred to some other branch. These two commits should be
removed.

OK, I'm getting quite tired now, I imagine you are too ;-) This is an
excellent branch, but I think with some effort it could be even more
excellent. I'm going to be occupied for the next few days, so I won't
get back to reviewing the tail of your branch -- though I'm sure you'll
have enough on your hands. Have a lovely holiday, and let's get this
into 1.9.2!

Peace,

Andy
--=20
http://wingolog.org/