From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: ludo@gnu.org (Ludovic =?iso-8859-1?Q?Court=E8s?=) Newsgroups: gmane.lisp.guile.devel Subject: Re: Wide string strategies Date: Fri, 10 Apr 2009 09:57:26 +0200 Message-ID: <87r601aql5.fsf@gnu.org> References: <1239289212.5673.52.camel@localhost.localdomain> <87prflefqk.fsf@gnu.org> <1239334758.7191.104.camel@localhost.localdomain> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1239350275 13655 80.91.229.12 (10 Apr 2009 07:57:55 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 10 Apr 2009 07:57:55 +0000 (UTC) Cc: guile-devel@gnu.org To: Mike Gran Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Fri Apr 10 09:59:13 2009 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1LsBdm-0006I7-K6 for guile-devel@m.gmane.org; Fri, 10 Apr 2009 09:59:10 +0200 Original-Received: from localhost ([127.0.0.1]:39241 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LsBcO-0003ph-0a for guile-devel@m.gmane.org; Fri, 10 Apr 2009 03:57:44 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1LsBcG-0003nL-49 for guile-devel@gnu.org; Fri, 10 Apr 2009 03:57:36 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1LsBcB-0003gL-5q for guile-devel@gnu.org; Fri, 10 Apr 2009 03:57:35 -0400 Original-Received: from [199.232.76.173] (port=60579 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LsBcB-0003gH-1K for guile-devel@gnu.org; Fri, 10 Apr 2009 03:57:31 -0400 Original-Received: from mail4-relais-sop.national.inria.fr ([192.134.164.105]:26228) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.60) (envelope-from ) id 1LsBcA-0002er-F0 for guile-devel@gnu.org; Fri, 10 Apr 2009 03:57:30 -0400 X-IronPort-AV: E=Sophos;i="4.40,165,1238968800"; d="scan'208";a="38236427" Original-Received: from unknown (HELO nixey) ([193.50.110.227]) by mail4-relais-sop.national.inria.fr with ESMTP/TLS/DHE-RSA-AES128-SHA; 10 Apr 2009 09:57:28 +0200 X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 21 Germinal an 217 de la =?iso-8859-1?Q?R=E9volution?= X-PGP-Key-ID: 0xEA52ECF4 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 821D 815D 902A 7EAB 5CEE D120 7FBA 3D4F EB1F 5364 X-OS: i686-pc-linux-gnu In-Reply-To: <1239334758.7191.104.camel@localhost.localdomain> (Mike Gran's message of "Thu, 09 Apr 2009 20:39:18 -0700") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.90 (gnu/linux) X-detected-operating-system: by monty-python.gnu.org: Genre and OS details not recognized. X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:8405 Archived-At: Hi Mike, Mike Gran writes: > On Thu, 2009-04-09 at 22:25 +0200, Ludovic Court=E8s wrote:=20 >> All the POSIX interface needs fast access to ASCII strings. How about >> something like: >>=20 >> const char *layout =3D scm_i_ascii_symbol_chars (SCM_PACK (slayout)); >>=20 >> where `scm_i_ascii_symbol_chars ()' throws an exception if its argument >> is a non-ASCII symbol? >>=20 >> This would mean special-casing ASCII stringbufs so that we can treat >> them as C strings. > > OK. Fast ASCII strings for the evaluator and for POSIX should be easy > enough. Are there any other modules that definitely require fast > strings? None that I can think of. Actually, for the file system interface, for instance, it's even trickier: the encoding of file names usually isn't specified, but some apps/libraries have their opinion on that, e.g., Glib (http://library.gnome.org/devel/glib/unstable/glib-File-Utilities.html). We should probably follow their lead here, but that's a secondary problem anyway. > Also, the interaction between strings and sockets needs more thought. > If sendto and recvfrom are used for datagram transmission, as it > suggests in their docstrings, then locale string conversion could be a > bad idea. (And, these functions should also operate on u8vectors, but > that's another issue.) Agreed. > To be more general, I know some apps depend on 8-bit strings and use > them as storage of non-string binary data. Yes, notably because of `sendto' et al. that take a string. > I think SND falls into this > category. I wonder if ultimately wide strings would have to be a > run-time option that is off by default. But I am (choose your English > idiom here) getting ahead of myself, or jumping the gun, or putting the > cart before the horse. I don't have any idea of how we could usefully handle that. Eventually, it may be a good idea to deprecate `(sento "foobar")' in favor of a variant that takes a bytevector or some such. >> > +SCM_INTERNAL int scm_i_string_ref_eq_int (SCM str, size_t x, int c); >>=20 >> Does it assume sizeof (int) >=3D 32 ? > > I suppose it does. But, I only used it to compare to the output of > scm_getc which also returns an int. I meant, is the intent that C contains a codepoint? >> > +SCM_INTERNAL char *scm_i_string_to_write_sz (SCM str); >> > +SCM_INTERNAL scm_t_uint8 *scm_i_string_to_u8sz (SCM str); >> > +SCM_INTERNAL SCM scm_i_string_from_u8sz (const scm_t_uint8 *str); >> > +SCM_INTERNAL const char *scm_i_string_to_failsafe_ascii_sz (SCM str); >> > +SCM_INTERNAL const char *scm_i_symbol_to_failsafe_ascii_sz (SCM str); >>=20 >> What does "sz" mean? > > Back in the day, "sz" was Microsoft-speak for the pointer to the first > character of a null-terminated char string. By not knowing that, you > have demonstrated that you remain unpolluted. ;-) I probably was trying > to avoid writing "scm_i_string_to_string." Ouch, I *think* I had seen it in some places but never knew where it comes from. :-) How about: SCM scm_i_from_ascii_string (const scm_t_uint8 *str); and similar? >>=20 >> > +/* For ASCII strings, SUB can be used to represent an invalid >> > + character. */ >> > +#define SCM_SUB ('\x1A') >>=20 >> Why SUB? How about `SCM_I_SUB_CHAR', `SCM_I_INVALID_ASCII_CHAR' or >> similar? > > If you're asking why SUB is set to 0x1A, the standard EMCA-48 says 0x1A > should be used to indicate an invalid ASCII character. I suspected that. Then `SCM_I_SUB_CHAR' may be a good name, perhaps with a comment saying that this is the "official SUB character". Thanks! Ludo'.