Re: Wide string strategies

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

From: Mike Gran <spk121@yahoo.com>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: guile-devel@gnu.org
Subject: Re: Wide string strategies
Date: Thu, 09 Apr 2009 20:39:18 -0700	[thread overview]
Message-ID: <1239334758.7191.104.camel@localhost.localdomain> (raw)
In-Reply-To: <87prflefqk.fsf@gnu.org>

On Thu, 2009-04-09 at 22:25 +0200, Ludovic Courtès wrote: 
> Hi!

> > -  SCM_WTA_DISPATCH_1 (*SCM_SUBR_GENERIC (proc), arg1,
> > -		      SCM_ARG1, scm_i_symbol_chars (SCM_SNAME (proc)));
> > +  {
> > +    char *str = scm_to_locale_string (scm_symbol_to_string (SCM_SNAME (proc)));
> > +    SCM_WTA_DISPATCH_1 (*SCM_SUBR_GENERIC (proc), arg1, SCM_ARG1, str);
> > +    free (str);
> > +  }
> 
> This is the kind of thing we can't afford in most cases.
> 
> Here STR is only needed because `SCM_WTA_DISPATCH_1 ()' calls
> `scm_wrong_type_arg ()', which operates on C strings.
> 
> One solution would be to change `scm_wrong_type_arg ()' to operate on
> opaque strings (e.g., take an `SCM' instead of `const char *').  The
> same applies to all the functions in "error.h", and probably many
> others.
> 

Makes sense.

> I think procedures like `scm_i_string_ref_eq_char ()' are a good idea
> because it fulfills the goal of having an opaque string type *and* the
> goal of being able to handle them easily in C.

I like it, too.

> All the POSIX interface needs fast access to ASCII strings.  How about
> something like:
> 
>   const char *layout = scm_i_ascii_symbol_chars (SCM_PACK (slayout));
> 
> where `scm_i_ascii_symbol_chars ()' throws an exception if its argument
> is a non-ASCII symbol?
> 
> This would mean special-casing ASCII stringbufs so that we can treat
> them as C strings.

OK.  Fast ASCII strings for the evaluator and for POSIX should be easy
enough.  Are there any other modules that definitely require fast
strings?

Also, the interaction between strings and sockets needs more thought.
If sendto and recvfrom are used for datagram transmission, as it
suggests in their docstrings, then locale string conversion could be a
bad idea.  (And, these functions should also operate on u8vectors, but
that's another issue.)

To be more general, I know some apps depend on 8-bit strings and use
them as storage of non-string binary data.  I think SND falls into this
category.  I wonder if ultimately wide strings would have to be a
run-time option that is off by default.  But I am (choose your English
idiom here) getting ahead of myself, or jumping the gun, or putting the
cart before the horse.

> > +SCM_INTERNAL int scm_i_string_ref_eq_char (SCM str, size_t x, char c);
> > +SCM_INTERNAL int scm_i_symbol_ref_eq_char (SCM str, size_t x, char c);
> > +SCM_INTERNAL int scm_i_string_ref_neq_char (SCM str, size_t x, char c);
> > +SCM_INTERNAL int scm_i_symbol_ref_neq_char (SCM str, size_t x, char c);
> 
> I'd remove the `neq' variants.
> 

Sure.

> > +SCM_INTERNAL int scm_i_string_ref_eq_int (SCM str, size_t x, int c);
> 
> Does it assume sizeof (int) >= 32 ?

I suppose it does.  But, I only used it to compare to the output of
scm_getc which also returns an int.

> 
> > +SCM_INTERNAL size_t scm_i_string_contains_char (SCM str, char ch);
> 
> Since it really returns a boolean, I'd use `int' as the return type.

Makes sense.

> 
> > +SCM_INTERNAL char *scm_i_string_to_write_sz (SCM str);
> > +SCM_INTERNAL scm_t_uint8 *scm_i_string_to_u8sz (SCM str);
> > +SCM_INTERNAL SCM scm_i_string_from_u8sz (const scm_t_uint8 *str);
> > +SCM_INTERNAL const char *scm_i_string_to_failsafe_ascii_sz (SCM str);
> > +SCM_INTERNAL const char *scm_i_symbol_to_failsafe_ascii_sz (SCM str);
> 
> What does "sz" mean?

Back in the day, "sz" was Microsoft-speak for the pointer to the first
character of a null-terminated char string.  By not knowing that, you
have demonstrated that you remain unpolluted. ;-) I probably was trying
to avoid writing "scm_i_string_to_string."

> 
> > +/* For ASCII strings, SUB can be used to represent an invalid
> > +   character.  */
> > +#define SCM_SUB ('\x1A')
> 
> Why SUB?  How about `SCM_I_SUB_CHAR', `SCM_I_INVALID_ASCII_CHAR' or
> similar?

If you're asking why SUB is set to 0x1A, the standard EMCA-48 says 0x1A
should be used to indicate an invalid ASCII character.  If you're asking
why I just called it SCM_SUB, laziness.

SCM_I_INVALID_ASCII_CHAR works for me.

> 
> Thanks,
> Ludo'.
> 
> 
I'll try to rework this next week.

-Mike

next prev parent reply	other threads:[~2009-04-10  3:39 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-09 15:00 Wide string strategies Mike Gran
2009-04-09 20:25 ` Ludovic Courtès
2009-04-10  3:39   ` Mike Gran [this message]
2009-04-10  7:57     ` Ludovic Courtès
2009-04-10 17:14       ` Mike Gran
2009-04-14  7:45         ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1239334758.7191.104.camel@localhost.localdomain \
    --to=spk121@yahoo.com \
    --cc=guile-devel@gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).