From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Andy Wingo Newsgroups: gmane.lisp.guile.devel Subject: string_abstraction2 review Date: Thu, 11 Jun 2009 23:12:51 +0200 Message-ID: NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1244754781 15782 80.91.229.12 (11 Jun 2009 21:13:01 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 11 Jun 2009 21:13:01 +0000 (UTC) Cc: guile-devel To: Mike Gran Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Thu Jun 11 23:12:58 2009 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1MErZv-0005yf-AC for guile-devel@m.gmane.org; Thu, 11 Jun 2009 23:12:56 +0200 Original-Received: from localhost ([127.0.0.1]:54841 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MErZu-0000Ft-LI for guile-devel@m.gmane.org; Thu, 11 Jun 2009 17:12:54 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1MErZr-0000Fk-LB for guile-devel@gnu.org; Thu, 11 Jun 2009 17:12:51 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1MErZn-0000FQ-CU for guile-devel@gnu.org; Thu, 11 Jun 2009 17:12:51 -0400 Original-Received: from [199.232.76.173] (port=55633 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MErZn-0000FN-8Y for guile-devel@gnu.org; Thu, 11 Jun 2009 17:12:47 -0400 Original-Received: from mx20.gnu.org ([199.232.41.8]:57667) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1MErZm-0006vT-MT for guile-devel@gnu.org; Thu, 11 Jun 2009 17:12:46 -0400 Original-Received: from a-sasl-fastnet.sasl.smtp.pobox.com ([207.106.133.19] helo=sasl.smtp.pobox.com) by mx20.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1MErZk-0002Kv-CM for guile-devel@gnu.org; Thu, 11 Jun 2009 17:12:44 -0400 Original-Received: from localhost.localdomain (unknown [127.0.0.1]) by a-sasl-fastnet.sasl.smtp.pobox.com (Postfix) with ESMTP id 230EDBA5CC; Thu, 11 Jun 2009 17:12:42 -0400 (EDT) Original-Received: from unquote (unknown [81.38.182.145]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by a-sasl-fastnet.sasl.smtp.pobox.com (Postfix) with ESMTPSA id 88512BA5CA; Thu, 11 Jun 2009 17:12:39 -0400 (EDT) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.92 (gnu/linux) X-Pobox-Relay-ID: 99C9317C-56CC-11DE-8E90-97731A10BFE7-02397024!a-sasl-fastnet.pobox.com X-Detected-Operating-System: by mx20.gnu.org: Solaris 10 (beta) X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6, seldom 2.4 (older, 4) X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:8666 Archived-At: Howdy good sir! I've been taking a look at string_abstraction2 this evening. (Well, I wrote that last evening. Time passes!) I'm really looking forward to it going in. The concepts seem sound, but I think the branch could be even better with a rebase. Further comments and impressions: commit 6d1c2613b799c86ac183a2f520c789f0afb8cf60 Author: Michael Gran Date: Sun May 17 20:07:53 2009 -0700 Gnulib updates to support wide characters We can probably drop this one entirely, no? [It imports the libunistring gnulib module, but we depend on libunistring being installed instead.] commit ffc654adb020bd8bd1b108e7c7047c8b7da410a3 Author: Michael Gran Date: Sun May 17 20:33:53 2009 -0700 Use 32 bit characters This one looks great. My only comment is that it would be better to keep SCM_MAKE_CHAR as a pure macro. commit dfbcbe20c7d010bc9f0a64f40b0aaae635915405 Author: Michael Gran Date: Sun May 17 21:03:00 2009 -0700 String abstraction -- don't unpack symbol characters in eval. Looks good, though to my eyes there's no need for the "String abstraction -- " prefix -- in the end these commits will stand alone in the git history. Please ignore if it's against other input you have received, or against your sensibilities :) commit 53464169b1799036ddee2fcf07b0dd92e3fc8d4b Author: Michael Gran Date: Sun May 17 22:36:04 2009 -0700 String abstraction -- use string operators in hash Looks good also. Only question is why scm_i_string_ref_to_int32 -- why not scm_t_wchar, and just call it scm_c_string_ref ? commit d311521f6e280aca9a2687624a5b36b38f60cfc7 Author: Michael Gran Date: Sun May 17 22:55:21 2009 -0700 String abstraction -- use string operators for filesys Looks good -- though again, scm_i_string_ref_eq_char doesn't sound right as a name. scm_i_string_has_char_at, maybe? Dunno.commit 07ed5be21432fc= b2b092d372067a002f6662aac5 Author: Michael Gran Date: Sun May 17 23:24:44 2009 -0700 string abstraction -- use symbol operators in garbage collector Same notes for scm_i_symbol_ref_eq_char, and I don't think scm_i_symbol_ref_to_char should exist -- the cost of that struct vtable layout cruftiness should probably be borne by struct.c. commit 554c99e7c348ce7d02f2337fa2f04d666c6de892 Author: Michael Gran Date: Sun May 17 23:34:46 2009 -0700 string abstraction -- avoid unpacking strings when defining goops c= lasses Looks fine, though some logic is duplicated. Still it's OK. Say, you're using tabs in some of these commits -- please don't :-) (setq indent-tabs-mode nil) if you use emacs. commit 6443bf5de7dd727cf5f5ededc016e6c349518da4 Author: Michael Gran Date: Mon May 18 06:35:08 2009 -0700 string abstraction -- avoid unpacking port mode strings Excelente commit 3570ed124988dbcc3f2085c5ef501cd124b945a7 Author: Michael Gran Date: Mon May 18 06:42:39 2009 -0700 string abstraction -- avoid unpacking mode mknod strings Cool commit 12348d6335c2cb9da08badc707dc1bd069844bb1 Author: Michael Gran Date: Mon May 18 06:47:43 2009 -0700 string abstraction -- don't unpack random state string The remember_upto_here needs to go just before the return. commit 322197c8e2e5aac004290a075d5391b88a66a7f7 Author: Michael Gran Date: Mon May 18 06:57:38 2009 -0700 string abstraction -- avoid unpacking delimiter strings Do we really need the eq_char version and the eq_wchar version? Can't we just have the latter, and char casts to wchar transparently? commit 4c57c26923c0b6ae5027c76dc32e5251a2f6cbcf Author: Michael Gran Date: Mon May 18 07:23:10 2009 -0700 string abstraction -- avoid unpacking strings in send/recv Hm. It seems it would be better for send/recv to take bytevectors, but to accept strings for back compatibility. It's a somewhat orthogonal question to your patch, but now that we have bytevectors, this is exactly what they're for. That way we can deprecate passing strings to these functions. What do you think? commit a462ad13abac7f12fa370fb1724a6eb7d80edcb2 Author: Michael Gran Date: Mon May 18 07:50:21 2009 -0700 string abstraction on charsets and combine redundant string_ref_to_= XXX I still thing scm_i_string_ref should just return a wchar :) Regarding character sets: I understand that Will Clinger has an efficient implementation of character sets in Scheme that works well with the sparse nature of Unicode codepoints. Perhaps that could be of use. Dunno. commit 1091a38a8599f41c551686845fa4559f67e898a1 Author: Michael Gran Date: Mon May 18 08:27:29 2009 -0700 Convert strings to utf8 to pass thru posix time funcs Hmm, this one seems dangerous to me. Would it not be possible for the UTF-8 serialization of a string to contain a valid % escape sequence? Looking over the UTF-8 conversion algorithm, I guess not, as only 0x00 to 0x80 have their high bit unset... OK :-) The only thing is that scm_strptime seems to leak memory if the strptime failed. commit c28492c576da916b93748b3fd4f069eb7b3e5a12 Author: Michael Gran Date: Mon May 18 18:42:06 2009 -0700 string abstraction -- don't unpack strings in struct Fine, but I still think you can do e.g. foo =3D=3D 'R' even if foo is a wchar; which obviates the need for those _char functions, and the SCM temporary variables. commit 07c5b853955972aeca744ca197e8169f739f5427 Author: Michael Gran Date: Mon May 18 19:01:37 2009 -0700 string abstraction -- avoid unpacking strings in goops and unif Looks good commit 6106178ffea886eaa10c7b931cd0657e8b871fc2 Author: Michael Gran Date: Tue May 19 19:46:53 2009 -0700 string abstraction -- use string functions to access string interna= ls This is a truly heroic commit. Only minor nitpicks: scm_i_string_set needs an _x suffix. REF_EQ_CHAR could probably be scm_c_string_has_char_at. Tabs should be spaces. Otherwise, wow. commit 5abb7ba75778c36620d381b157eea451d6498c0f Author: Michael Gran Date: Tue May 19 20:02:58 2009 -0700 string abstraction -- avoid unpacking symbol characters I worry that lookup_interned_symbol could be too slow. Maybe a scm_i_symbol_has_name that could be implemented in strings.c and just result in a strcmp ? I was about to complain about scm_take_locale_symboln maybe losing some efficiency, but it doesn't seem we use it at all, so that's probably OK. commit fdc2be0ae9a022a7ff5f1f83db36e821eac895a3 Author: Michael Gran Date: Tue May 19 22:29:43 2009 -0700 string abstraction -- don't unpack strings when converting strings = to numbers Another great commit. commit 7ba0bb8eff42f110479a687446a8b8831fa75f4a Author: Michael Gran Date: Tue May 19 23:30:31 2009 -0700 Add funcs to lfwrite strings to ports without unpacking them Looks good commit c692ee76f274469ec48e86064b22ae819a498748 Author: Michael Gran Date: Wed May 20 06:48:21 2009 -0700 string abstraction -- avoid unpacking strings when unreading strings cool commit 60cc9d00e4ba1f5fbe8dbe3254e0c9d1cc09fb9c Author: Michael Gran Date: Wed May 20 06:51:16 2009 -0700 string abstraction -- avoid unpacking strings when creating obarrays I don't care about obarrays ;-) commit 4a17d5e6fca109bb9943666664fe49e3879c957f Author: Michael Gran Date: Wed May 20 07:05:39 2009 -0700 string abstraction -- avoid unpacking guile strings Many of these strings could leak if the locale does not validate, though this point may be moot if/when we merge BDW-GC. Suggest passing an already-validated scm_t_locale to compare_strings. I had no idea these functions existed ;) commit c16c6626a522d1c3f32e79ed57103b7676d54e51 Author: Michael Gran Date: Wed May 20 07:20:24 2009 -0700 symbol abstraction -- avoid unpacking symbol when making keywords OK commit c9dc2528cfdc97351f5f9182e5b9b66ba2d2c21f Author: Michael Gran Date: Wed May 20 07:50:01 2009 -0700 symbol abstraction -- avoid unpacking symbols when printing them Looks good. Why does scm_i_print_symbol_name need to be in the header, though? commit 03455b48fe28003b8cf1333ba593224d8fe34e2e Author: Michael Gran Date: Wed May 20 21:30:19 2009 -0700 string abstraction -- avoid unpacking strings in gentemp OK commit b507262d75db26b6034328396b0be8837f2d1437 Author: Michael Gran Date: Thu May 21 19:06:23 2009 -0700 Use Latin1 & UTF-32 as internal string encoding When you rebase, I would do whatever you need to do to Gnulib first, and not with other Guile changes. Makes the diffs a little easier to read :) Does our use of libunistring not obviate the need for strconveh et al? setbinary is a nasty interface, though it may indeed be necessary. At the least, it should return the previous port encoding. make_wide_stringbuf should check that the size_t len is within range so as to prevent integer overflows. Same concern in widen_stringbuf. (This kind of thing is a common source of security vulnerabilities, as you probably know :) scm_i_is_narrow_string should be scm_i_string_is_narrow, as it does assume the arg is a string. Same with scm_i_is_narrow_symbol. There is some funny indentation in scm_i_string_chars. I like the approach taken there, though. Check indentation too at the end of scm_i_string_writable_chars. %string-dump has different spacing when dumping narrow vs wide chars. Check the format strings. scm_string should declare the return of scm_ilength as an ssize_t instead of a long. There shouldn't be any errors, and if there are, that could indicate other problems. I fear scm_c_make_string could be much slower, given that it's going to be locking and unlocking a mutex on each character set. Can you work around this? scm_from/to_locale_string look *great*. i18n.test and ports.test need to have their setlocale / setbinary invocations inside an (eval-when (eval load compile) ...) block, so the expander does the right thing with them. They should reset the locale/encoding to their previous values at the end of the tests, as the tests are all run together in one process. But yes, another heroic commit.commit 84df3e211d1eaab8b0735a6f1f27f56e4= f579875 Author: Michael Gran Date: Sun May 24 17:15:10 2009 -0700 add tests for encoding/decoding wide strings Cool. Also, wow: +(with-test-prefix + "non-ascii variable names" + + (pass-if "1" + (let ((=E8=8A=A5=E5=B7=9D=E9=BE=8D=E4=B9=8B=E4=BB=8B 1) + (=C3=B1 2)) + (eq? (+ =E8=8A=A5=E5=B7=9D=E9=BE=8D=E4=B9=8B=E4=BB=8B =C3=B1) 3)))) commit b2d407915c35e5a45d65be9c6c15abcda16b8c87 Author: Michael Gran Date: Mon May 25 11:40:46 2009 -0700 Read and write wide strings. Allow wide symbols. Interpret wide = code. I understand that making scm_getc inline was quite important for performance. Is it possible to fast-path the ascii case so scm_getc is inlined, and only calls out-of-line for the slow case? Is it /really/ the case that ascii alnums represent themselves in all encodings except UTF-16 and UTF-32 ? And doesn't your fastpath not take the UTF-16/32 cases into account? I do not envy you for mucking about in ports.c, btw ;) You note that normally setencoding is called for you by setlocale; but why would you call it on its own? Also u32 is leaked if the encoding isn't valid, in scm_setencoding. read_token is likely much slower than it was now, calling out-of-line to lock the mutex, etc. Would be nice to inline the string set, in scm_read_sexp as well. read_complete_token is a nice simplification. Can you really assume that newline is newline in all encodings? The string port business is a bit nasty -- of course, that nastiness proceeds from the port API itself. But would it not be possible to base string ports on top of bytevectors, instead of using strings as a poor man's bytevector? Bytevectors are better than u8vectors because they can be viewed as other types, and parsed as unicode. Also, with-output-to-locale-u8vector is a nasty name ;) It probably doesn't belong in r4rs.scm, which is loaded very early in the boot process. Our boot files need some re-arranging... commit 823e444052817ee120d87a3575acb4f767f17475 Author: Michael Gran Date: Sun May 24 17:15:10 2009 -0700 add tests for encoding/decoding wide strings The tests won't work with the compiler, for lack of eval-when, but we can fix that later.commit de1bb55692fc9fc35748ba24da36ee3086068053 Author: Michael Gran Date: Mon May 25 15:12:27 2009 -0700 revert scm_getc to be SCM_API scm_getc should probably be the inlined one, as I mentioned before. commit a3b9555bd2b58c62f0a1c8431f6d91703e43dbf8 Author: Michael Gran Date: Mon May 25 15:17:37 2009 -0700 workaround to make guile-readline buile I didn't see why you needed to define _UNUSED_PARAMETER_ -- do our public headers now require it? (That would be an error.) commit 1ffb9fb19ccec5e96d762e073a9bbabbc41962e4 Author: Michael Gran Date: Mon May 25 15:19:37 2009 -0700 vm string loader should use internal encoding of strings I think instead of this, we should have load-{latin1,utf32}-{string,symbol,keyword}. Of course that requires some work to the compiler as well as the VM. commit 7994186ddd09f140a00ef04891c966714a5f28e2 Author: Michael Gran Date: Mon May 25 15:21:17 2009 -0700 declare encoding of test source files Better to merge the scan-for-encoding patches first, no? commit f20946c01b53b8022e0044d1ada32e52faf3c577 Author: Michael Gran Date: Mon May 25 15:33:17 2009 -0700 fix broken tests: check locale encoding using u8 vectors Ah, I see now why you have with-output-to-locale-u8vector. How very, very unfortunate. While I haven't gotten to your ports-have-their-own-encodings patches, would it not be possible to have string ports' encoding be UTF-32? Then you could simply convert to the locale, and you'd be fine after that. What do you think? commit 9bfb3f8ab6e6d1a134c7562f2bc25392c032653e Author: Michael Gran Date: Mon May 25 15:37:24 2009 -0700 rename encoding_*.test to encoding-*.test commit b123b24a4695d42fa797cd6902f8866eac1659d5 Author: Michael Gran Date: Mon May 25 15:39:52 2009 -0700 add new tests to Makefile.am Please squash these two. Also, you have spaces in your Makefile changes where we have tabs ;) Please check test-suite/Makefile.am to be sure. commit cd130d9b5fbb6345a57dfde0800be8a9a9257203 Author: Michael Gran Date: Tue May 26 05:52:42 2009 -0700 Cosmetic changes. Indenting. Cool commit 1a4e40f36a3a22b0eb507502c54ece487d472056 Author: Michael Gran Date: Tue May 26 08:56:06 2009 -0700 move reader encoding parameters to fluids As a matter of style, I would stash away the fluid as your global var, not the variable -- because it's the fluid which provides the rebinding facility. If the function is called before it is initialized, just make sure the program crashes -- that's a Guile internal error. So scm_i_get_conversion_strategy () should be simply scm_fluid_ref (conversion_strategy), then returning the appropriate ilseq handler. IMO, of course. Otherwise looks fine. commit 39efc84f1569365c05ee98efac9093bbde78c3d7 Author: Michael Gran Date: Thu May 28 21:52:44 2009 -0700 use scheme file encoding declared in the comment header. Wow, this is really wild. You scan bytes and strcmp as ASCII, but then hope the rest of the buffer is compatible with the original encoding? How would one encode a file as UTF-16? Do people just not do that? *after looking* Ah, I get it now, people only encode UTF-16/32 with byte-order marks. We should catch those in this function, it seems we only catch the UTF-8 BOM. commit 9d52ea2fa4073582113f4235dededb002682e459 Author: Michael Gran Date: Mon Jun 1 07:47:04 2009 -0700 Revert the changes of 726b1624a9e3dd1251f18916f2dd0409ffb71e54 Then you revert it here -- but that is not the right SHA1 AFAICT, perhaps it referred to some other branch. These two commits should be removed. OK, I'm getting quite tired now, I imagine you are too ;-) This is an excellent branch, but I think with some effort it could be even more excellent. I'm going to be occupied for the next few days, so I won't get back to reviewing the tail of your branch -- though I'm sure you'll have enough on your hands. Have a lovely holiday, and let's get this into 1.9.2! Peace, Andy --=20 http://wingolog.org/