From: ludo@gnu.org (Ludovic Courtès)
To: guile-devel@gnu.org
Subject: Re: Wide strings status
Date: Tue, 21 Apr 2009 23:37:35 +0200 [thread overview]
Message-ID: <87bpqpu1r4.fsf@gnu.org> (raw)
In-Reply-To: 1240279908.3133.76.camel@localhost.localdomain
Hello!
Mike Gran <spk121@yahoo.com> writes:
> Strings are internally encoded either as "narrow" 8-bit ISO-8859-1
> strings or as "wide" UTF-32 strings. Strings are usually created as
> narrow strings. Narrow strings get automatically widened to wide
> strings if non-8-bit characters are set! or appended to them.
Great!
> The machine-readable "write" form of strings has been changed. Before,
> non-printable characters were given as hex escapes, for example \xFF.
> Now there are three levels of hex escape for 8, 16, and 24 bit
> characters: \xFF, \uFFFF, \UFFFFFF. This is a pretty common convention.
> But after I coded this, I noticed that R6RS has a different convention
> and I'll probably go with that.
OK. I think it's probably good to follow R6RS when it has something to
say.
> The internal representation of strings seems to work already, but, the
> reader doesn't work yet. For now, one can make wide strings like this:
>
>> (setlocale LC_ALL "")
> ==> "en_US.UTF-8"
>
>> (define str (apply string (map integer->char '(100 200 300 400 500))))
>
>> (write str)
> ==>"d\xc8\u012c\u0190\u01f4"
>
> (display str)
> ==>dÈĬƐǴ
Eh eh, looks nice. Looking forward to typing `(λ (x y) (+ x y))'. ;-)
> This is all going to be slower than before because of the string
> conversion operations, but, I didn't want to do any premature
> optimization. First, I wanted to get it working, but, there is plenty
> of room for optimization later.
Good. Maybe it'd be nice to add simple micro-benchmarks for
`string-ref', `string-set!' et al. under `benchmarks'.
> Character encoding needs to be a property of ports, so that not all
> string operations are done in the current locale. This is necessary so
> that UTF-8-encoded source files are not interpreted differently based on
> the current locale.
You seem to imply that `scm_getc ()' will now return a Unicode
codepoint, is that right? What about `scm_c_{read,write} ()', and
`scm_{get,put}s ()'?
> The VM and interpreter need to be updated to deal with wide chars and
> probably in other ways that are unclear to me now. Wide strings are
> currently getting truncated to 8-bit somewhere in there.
The compiler could use bytevectors when dealing with bytecode. Maybe
that would clarify things.
Thanks,
Ludo'.
next prev parent reply other threads:[~2009-04-21 21:37 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-21 2:11 Wide strings status Mike Gran
2009-04-21 21:37 ` Ludovic Courtès [this message]
2009-04-22 3:26 ` Mike Gran
2009-04-22 20:03 ` Ludovic Courtès
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87bpqpu1r4.fsf@gnu.org \
--to=ludo@gnu.org \
--cc=guile-devel@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).