From: Mike Gran <spk121@yahoo.com>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: guile-devel@gnu.org
Subject: Re: Wide string strategies
Date: Fri, 10 Apr 2009 10:14:00 -0700 (PDT) [thread overview]
Message-ID: <316910.55438.qm@web37903.mail.mud.yahoo.com> (raw)
In-Reply-To: <87r601aql5.fsf@gnu.org>
> From: Ludovic Courtès <ludo@gnu.org>
> Mike Gran writes:
> > On Thu, 2009-04-09 at 22:25 +0200, Ludovic Courtès wrote:
> Actually, for the file system interface, for instance, it's even
> trickier: the encoding of file names usually isn't specified, but some
> apps/libraries have their opinion on that, e.g., Glib
> (http://library.gnome.org/devel/glib/unstable/glib-File-Utilities.html).
> We should probably follow their lead here, but that's a secondary
> problem anyway.
True. The one real standard that I do know is that NTFS requires UTF-8
filenames.
>
> > Also, the interaction between strings and sockets needs more thought.
> > If sendto and recvfrom are used for datagram transmission, as it
> > suggests in their docstrings, then locale string conversion could be a
> > bad idea. (And, these functions should also operate on u8vectors, but
> > that's another issue.)
>
> Agreed.
>
> > To be more general, I know some apps depend on 8-bit strings and use
> > them as storage of non-string binary data.
>
> Yes, notably because of `sendto' et al. that take a string.
>
> > I think SND falls into this
> > category. I wonder if ultimately wide strings would have to be a
> > run-time option that is off by default. But I am (choose your English
> > idiom here) getting ahead of myself, or jumping the gun, or putting the
> > cart before the horse.
>
> I don't have any idea of how we could usefully handle that.
>
> Eventually, it may be a good idea to deprecate `(sento "foobar")' in
> favor of a variant that takes a bytevector or some such.
Maybe its best to leave them unchanged w.r.t strings. Any char values between
128 and 255 would just be interpreted as if they were UCS-4 characters
128 to 255 and get put in the strings directly.
In the short term, socket functions could also be modified
to take both strings and u8vectors. Then, if someone was actually
pushing UTF strings over the network, they could use
"utf8-encoded-u8vector->string" or some such to do the conversion.
And, in the long run, sockets can become a type of port, and those
ports can have attached transcoding.
>
> >> > +SCM_INTERNAL int scm_i_string_ref_eq_int (SCM str, size_t x, int c);
> >>
> >> Does it assume sizeof (int) >= 32 ?
> >
> > I suppose it does. But, I only used it to compare to the output of
> > scm_getc which also returns an int.
>
> I meant, is the intent that C contains a codepoint?
Yes. And when wide strings are implemented, the gnulib convention is
that a wide character is represented in C as uint32.
>
> >> > +SCM_INTERNAL char *scm_i_string_to_write_sz (SCM str);
> >> > +SCM_INTERNAL scm_t_uint8 *scm_i_string_to_u8sz (SCM str);
> >> > +SCM_INTERNAL SCM scm_i_string_from_u8sz (const scm_t_uint8 *str);
> >> > +SCM_INTERNAL const char *scm_i_string_to_failsafe_ascii_sz (SCM str);
> >> > +SCM_INTERNAL const char *scm_i_symbol_to_failsafe_ascii_sz (SCM str);
> How about:
>
> SCM scm_i_from_ascii_string (const scm_t_uint8 *str);
>
> and similar?
OK.
> >>
> >> > +/* For ASCII strings, SUB can be used to represent an invalid
> >> > + character. */
> >> > +#define SCM_SUB ('\x1A')
> >>
> >> Why SUB? How about `SCM_I_SUB_CHAR', `SCM_I_INVALID_ASCII_CHAR' or
> >> similar?
> >
> > If you're asking why SUB is set to 0x1A, the standard EMCA-48 says 0x1A
> > should be used to indicate an invalid ASCII character.
>
> I suspected that. Then `SCM_I_SUB_CHAR' may be a good name, perhaps
> with a comment saying that this is the "official SUB character".
>
OK.
Thanks,
Mike
next prev parent reply other threads:[~2009-04-10 17:14 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-09 15:00 Wide string strategies Mike Gran
2009-04-09 20:25 ` Ludovic Courtès
2009-04-10 3:39 ` Mike Gran
2009-04-10 7:57 ` Ludovic Courtès
2009-04-10 17:14 ` Mike Gran [this message]
2009-04-14 7:45 ` Ludovic Courtès
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=316910.55438.qm@web37903.mail.mud.yahoo.com \
--to=spk121@yahoo.com \
--cc=guile-devel@gnu.org \
--cc=ludo@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).