I'm looking for a method of converting a string's character encoding

unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed

* I'm looking for a method of converting a string's character encoding
@ 2012-04-27 21:13 Sunjoong Lee
  2012-04-28  1:40 ` Sunjoong Lee
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Sunjoong Lee @ 2012-04-27 21:13 UTC (permalink / raw)
  To: guile-user

[-- Attachment #1: Type: text/plain, Size: 1152 bytes --]

Hello,

I'm looking for a method of converting a string's character encoding from a
certain codeset to utf-8. I know the string of Guile uses utf-8 and (read
(open-bytevector-input-port (string->utf8 "hello"))) returns "hello" . But
what if the string "hello" be encoded not utf-8 and you want to get utf-8
converted string? What I want is like iconv.

Background;
#:decode-body? keyword of http-get seems not to work properly; I should
set #:decode-body? to false value and decode the contents body string
manually. If a web page's charset be utf-8, there be no problem. If not, a
problem occurs. decode-response-body of (web client) call decode-string
with web page's charset. But real charset of bytevector is iso-8859-1,
not web page's charset. If so, you should not let http-get
use decode-response-body.

After getting response-body with bytevector form, you should decode it with
"iso-8859-1" like decode-string's manner. Then you'll get web page's
contents body string; it's charset is what you see in response header.

Now, I need to convert this contents body string to utf-8 but I don't know
how. I think it would be with port i/o.

Thanks.

[-- Attachment #2: Type: text/html, Size: 1411 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: I'm looking for a method of converting a string's character encoding
  2012-04-27 21:13 I'm looking for a method of converting a string's character encoding Sunjoong Lee
@ 2012-04-28  1:40 ` Sunjoong Lee
  2012-04-28 16:38 ` Sunjoong Lee
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Sunjoong Lee @ 2012-04-28  1:40 UTC (permalink / raw)
  To: guile-user

[-- Attachment #1: Type: text/plain, Size: 1267 bytes --]

Are file-port and string-port much different? I can convert strings in file
but I want not to use file. My terminal charset is utf-8. Suppose there be
a "XXX" encoded text file "a.txt", it would be converted like this:

(use-modules (ice-9 rdelim))
(set-port-encoding! (current-output-port) "utf-8")
(define port (open-input-file "a.txt"))
(set-port-encoding! port "XXX")
(display (read-delimited "" port))
(close-port port)

I tried similar manner with string-port but failed. In real case, there is
"XXX" encoded string. In this case, I cannot prepare it, so read it from a
file.

(use-modules (ice-9 rdelim))
(define port (open-input-file "a.txt"))
(set-port-encoding! port "XXX")
(let ((port1 (open-input-string (let ((str (read-delimited "" port)))
                                  (close-input-port port)
                                  str)))
      (port2 (open-output-string)))
  (set-port-encoding! port1 "XXX")
  (set-port-encoding! port2 "utf-8")
  (display (read-delimited "" port1) port2)
  (close-input-port port1)
  (display (get-output-string port2))
  (close-output-port port2))

2012/4/28 Sunjoong Lee <sunjoong@gmail.com>
>
> Now, I need to convert this contents body string to utf-8 but I don't know
> how. I think it would be with port i/o.
>

[-- Attachment #2: Type: text/html, Size: 2231 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: I'm looking for a method of converting a string's character encoding
  2012-04-27 21:13 I'm looking for a method of converting a string's character encoding Sunjoong Lee
  2012-04-28  1:40 ` Sunjoong Lee
@ 2012-04-28 16:38 ` Sunjoong Lee
  2012-04-28 17:33   ` Thien-Thi Nguyen
  2012-05-02  3:57 ` Daniel Hartwig
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Sunjoong Lee @ 2012-04-28 16:38 UTC (permalink / raw)
  To: guile-user

[-- Attachment #1: Type: text/plain, Size: 1042 bytes --]

http-get is innocent but I need encoding converter yet.

In front-of my program, after appending a line (set-port-encoding!
(current-output-port) "utf-8") , the contents body string of web page
displayed well. With with-fluids and %default-port-encoding, I can use
html->sxml . But contents of output sxml's codeset is the original web
page's. For example, when you want to compare strings, you must use codeset
of the web pages's. If you want to compare strings of two web pages,
codeset converting method may be need.

2012/4/28 Sunjoong Lee <sunjoong@gmail.com>
>
> Background;
> #:decode-body? keyword of http-get seems not to work properly; I should
> set #:decode-body? to false value and decode the contents body string
> manually. If a web page's charset be utf-8, there be no problem. If not, a
> problem occurs. decode-response-body of (web client) call decode-string
> with web page's charset. But real charset of bytevector is iso-8859-1,
> not web page's charset. If so, you should not let http-get
> use decode-response-body.
>

[-- Attachment #2: Type: text/html, Size: 1382 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: I'm looking for a method of converting a string's character encoding
  2012-04-28 16:38 ` Sunjoong Lee
@ 2012-04-28 17:33   ` Thien-Thi Nguyen
  2012-04-28 18:29     ` Daniel Krueger
  0 siblings, 1 reply; 15+ messages in thread
From: Thien-Thi Nguyen @ 2012-04-28 17:33 UTC (permalink / raw)
  To: Sunjoong Lee; +Cc: guile-user

() Sunjoong Lee <sunjoong@gmail.com>
() Sun, 29 Apr 2012 01:38:28 +0900

   http-get is innocent but I need encoding converter yet.

It sounds like a good exercise (that would flush out bugs and
raise confidence in the infrastructure) would be to implement
an iconv-workalike program in Scheme.  Maybe one already exists?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: I'm looking for a method of converting a string's character encoding
  2012-04-28 17:33   ` Thien-Thi Nguyen
@ 2012-04-28 18:29     ` Daniel Krueger
  2012-04-28 19:54       ` Thien-Thi Nguyen
  2012-04-28 20:55       ` Eli Zaretskii
  0 siblings, 2 replies; 15+ messages in thread
From: Daniel Krueger @ 2012-04-28 18:29 UTC (permalink / raw)
  To: Thien-Thi Nguyen; +Cc: guile-user, Sunjoong Lee

Hi,

i think there shouldn't be any transcoding of guile's strings, as
strings are internal representation of characters, no matter how they
are encoded. So the only time when encoding matters is when it passes
it's `internal boundarys', i mean if you write the string to a port or
read from a port or pass it as a string to a foreign library. For the
ports all transcoding is available, and as said, the real
representation of guile strings internally is as utf8, which can't be
changed. The only additional thing i forgot about are bytevectors, if
you convert a string to an explicit representation, but afaik there
you also can give the encoding to use.

Am I wrong?

- Daniel

On Sat, Apr 28, 2012 at 7:33 PM, Thien-Thi Nguyen <ttn@gnuvola.org> wrote:
> () Sunjoong Lee <sunjoong@gmail.com>
> () Sun, 29 Apr 2012 01:38:28 +0900
>
>   http-get is innocent but I need encoding converter yet.
>
> It sounds like a good exercise (that would flush out bugs and
> raise confidence in the infrastructure) would be to implement
> an iconv-workalike program in Scheme.  Maybe one already exists?
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: I'm looking for a method of converting a string's character encoding
  2012-04-28 18:29     ` Daniel Krueger
@ 2012-04-28 19:54       ` Thien-Thi Nguyen
  2012-04-28 20:55       ` Eli Zaretskii
  1 sibling, 0 replies; 15+ messages in thread
From: Thien-Thi Nguyen @ 2012-04-28 19:54 UTC (permalink / raw)
  To: Daniel Krueger; +Cc: guile-user, Sunjoong Lee

() Daniel Krueger <keenbug@googlemail.com>
() Sat, 28 Apr 2012 20:29:22 +0200

   i think there shouldn't be any transcoding of guile's strings,
   as strings are internal representation of characters, no matter
   how they are encoded. So the only time when encoding matters is
   when it passes it's `internal boundarys', i mean if you write
   the string to a port or read from a port or pass it as a string
   to a foreign library.

Indeed, iconv(1) converts external representations (files).  How
it does that internally is an implementation detail.  That's the
main reason why i suggested it as a model for exercising Guile's
internals -- it's very easy to check correctness.

   For the ports all transcoding is available, and as said, the
   real representation of guile strings internally is as utf8,
   which can't be changed.

IIUC, the internal representation of strings is not UTF-8 (at
least, not all the time), but anyway, that doesn't matter at all.
The proposed task is to use procedures and features provided by
Guile (i.e., its public API) to do mimic iconv.

   The only additional thing i forgot about are bytevectors, if
   you convert a string to an explicit representation, but afaik
   there you also can give the encoding to use.

   Am I wrong?

I don't know.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: I'm looking for a method of converting a string's character encoding
  2012-04-28 18:29     ` Daniel Krueger
  2012-04-28 19:54       ` Thien-Thi Nguyen
@ 2012-04-28 20:55       ` Eli Zaretskii
  2012-04-28 22:42         ` Sunjoong Lee
                           ` (2 more replies)
  1 sibling, 3 replies; 15+ messages in thread
From: Eli Zaretskii @ 2012-04-28 20:55 UTC (permalink / raw)
  To: Daniel Krueger; +Cc: guile-user, ttn, sunjoong

> Date: Sat, 28 Apr 2012 20:29:22 +0200
> From: Daniel Krueger <keenbug@googlemail.com>
> Cc: guile-user@gnu.org, Sunjoong Lee <sunjoong@gmail.com>
> 
> i think there shouldn't be any transcoding of guile's strings, as
> strings are internal representation of characters, no matter how they
> are encoded. So the only time when encoding matters is when it passes
> it's `internal boundarys', i mean if you write the string to a port or
> read from a port or pass it as a string to a foreign library. For the
> ports all transcoding is available, and as said, the real
> representation of guile strings internally is as utf8, which can't be
> changed. The only additional thing i forgot about are bytevectors, if
> you convert a string to an explicit representation, but afaik there
> you also can give the encoding to use.
> 
> Am I wrong?

You are mostly right, but only "mostly".  Experience teaches that
sometimes you need to change encoding even inside "the boundaries".
One notable example is when the original encoding was determined
incorrectly, and the application wants to "re-decode" the string, when
its external origin is no longer available.  Another example is an
application that wants to convert an encoded string into base-64 (or
similar) form -- you'll need to encode the string internally first.

These kinds of rare, but still important, use cases are the reason why
Emacs Lisp has primitives to do encoding and decoding of in-memory
strings; as much as Emacs maintainers want to get rid of the related
need to support "unibyte strings", they are not going to go away any
time soon.

IOW, Guile needs a way to represent a string encoded in something
other than UTF-8, and convert between UTF-8 and other encodings.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: I'm looking for a method of converting a string's character encoding
  2012-04-28 20:55       ` Eli Zaretskii
@ 2012-04-28 22:42         ` Sunjoong Lee
  2012-04-29  0:25         ` Sunjoong Lee
  2012-04-30 10:18         ` Daniel Krueger
  2 siblings, 0 replies; 15+ messages in thread
From: Sunjoong Lee @ 2012-04-28 22:42 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user, Daniel Krueger, ttn

[-- Attachment #1: Type: text/plain, Size: 2660 bytes --]

Thanks hien-Thi, Daniel and Eli.

Eli pointed a good example; I'll say another one. In the countries, it's
character encoded multibytes, like China, Japan and Korea (i.e., in CJKs),
it would be a common issue to convert codeset. In Korea, a certain web page
may be written by EUC-KR codeset and another by UTF-8. In Japan, Shift-JIS,
EUC-JP, ISO-2022-JP and UTF-8. In China, GBK, gb18030, Big5, Big5-HKSCS and
UTF-8. I mean that koreans use 2 different codesets, japanese 4, chinese 5
in the net.

It seems not to happen comparing chinese web page and korean web page with
a same program but... Suppose you want to write a program monitoring web
pages, the codeset converter would be need. Just in CJKs? Greeks use 3
codesets, vietnamese 2, arabs 3, and so on. It looks like that russians use
many codesets like chinese.

2012/4/29 Eli Zaretskii <eliz@gnu.org>

> > Date: Sat, 28 Apr 2012 20:29:22 +0200
> > From: Daniel Krueger <keenbug@googlemail.com>
> > Cc: guile-user@gnu.org, Sunjoong Lee <sunjoong@gmail.com>
> >
> > i think there shouldn't be any transcoding of guile's strings, as
> > strings are internal representation of characters, no matter how they
> > are encoded. So the only time when encoding matters is when it passes
> > it's `internal boundarys', i mean if you write the string to a port or
> > read from a port or pass it as a string to a foreign library. For the
> > ports all transcoding is available, and as said, the real
> > representation of guile strings internally is as utf8, which can't be
> > changed. The only additional thing i forgot about are bytevectors, if
> > you convert a string to an explicit representation, but afaik there
> > you also can give the encoding to use.
> >
> > Am I wrong?
>
> You are mostly right, but only "mostly".  Experience teaches that
> sometimes you need to change encoding even inside "the boundaries".
> One notable example is when the original encoding was determined
> incorrectly, and the application wants to "re-decode" the string, when
> its external origin is no longer available.  Another example is an
> application that wants to convert an encoded string into base-64 (or
> similar) form -- you'll need to encode the string internally first.
>
> These kinds of rare, but still important, use cases are the reason why
> Emacs Lisp has primitives to do encoding and decoding of in-memory
> strings; as much as Emacs maintainers want to get rid of the related
> need to support "unibyte strings", they are not going to go away any
> time soon.
>
> IOW, Guile needs a way to represent a string encoded in something
> other than UTF-8, and convert between UTF-8 and other encodings.
>

[-- Attachment #2: Type: text/html, Size: 3511 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: I'm looking for a method of converting a string's character encoding
  2012-04-28 20:55       ` Eli Zaretskii
  2012-04-28 22:42         ` Sunjoong Lee
@ 2012-04-29  0:25         ` Sunjoong Lee
  2012-04-30 10:18         ` Daniel Krueger
  2 siblings, 0 replies; 15+ messages in thread
From: Sunjoong Lee @ 2012-04-29  0:25 UTC (permalink / raw)
  To: Daniel Krueger; +Cc: guile-user

[-- Attachment #1: Type: text/plain, Size: 2670 bytes --]

Only supporting UTF-8 is still strange but I understand why Daniel said so
now. After these two line appending, most of my problem on http-get was
solved:

(set-port-encoding! (current-output-port) "UTF-8")
(fluid-set! %default-port-encoding "UTF-8")

This is like a magic!! I think it's better to append this information to
Guile manual page. My first problem was not displaying contents body of web
page. Second was not calling html->sxml of guile-lib. After reading
htmlparg.scm, I realized html->sxml will call htmlprag-internal:parse-html
and htmlprag-internal:parse-html use the string-port. I remembered this
sentense; "When string ports are created, they do not inherit a character
encoding from the current locale." Most people would not realize utility
like html->sxml how to implemented and you need to use fluid-set! .

2012/4/29 Eli Zaretskii <eliz@gnu.org>

> > Date: Sat, 28 Apr 2012 20:29:22 +0200
> > From: Daniel Krueger <keenbug@googlemail.com>
> > Cc: guile-user@gnu.org, Sunjoong Lee <sunjoong@gmail.com>
> >
> > i think there shouldn't be any transcoding of guile's strings, as
> > strings are internal representation of characters, no matter how they
> > are encoded. So the only time when encoding matters is when it passes
> > it's `internal boundarys', i mean if you write the string to a port or
> > read from a port or pass it as a string to a foreign library. For the
> > ports all transcoding is available, and as said, the real
> > representation of guile strings internally is as utf8, which can't be
> > changed. The only additional thing i forgot about are bytevectors, if
> > you convert a string to an explicit representation, but afaik there
> > you also can give the encoding to use.
> >
> > Am I wrong?
>
> You are mostly right, but only "mostly".  Experience teaches that
> sometimes you need to change encoding even inside "the boundaries".
> One notable example is when the original encoding was determined
> incorrectly, and the application wants to "re-decode" the string, when
> its external origin is no longer available.  Another example is an
> application that wants to convert an encoded string into base-64 (or
> similar) form -- you'll need to encode the string internally first.
>
> These kinds of rare, but still important, use cases are the reason why
> Emacs Lisp has primitives to do encoding and decoding of in-memory
> strings; as much as Emacs maintainers want to get rid of the related
> need to support "unibyte strings", they are not going to go away any
> time soon.
>
> IOW, Guile needs a way to represent a string encoded in something
> other than UTF-8, and convert between UTF-8 and other encodings.
>

[-- Attachment #2: Type: text/html, Size: 3430 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: I'm looking for a method of converting a string's character encoding
  2012-04-28 20:55       ` Eli Zaretskii
  2012-04-28 22:42         ` Sunjoong Lee
  2012-04-29  0:25         ` Sunjoong Lee
@ 2012-04-30 10:18         ` Daniel Krueger
  2012-04-30 12:21           ` Eli Zaretskii
  2012-05-03 22:34           ` Ludovic Courtès
  2 siblings, 2 replies; 15+ messages in thread
From: Daniel Krueger @ 2012-04-30 10:18 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: guile-user, ttn, sunjoong

On Sat, Apr 28, 2012 at 10:55 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> One notable example is when the original encoding was determined
> incorrectly, and the application wants to "re-decode" the string, when
> its external origin is no longer available.

Okay, but then I would suggest either if you know you're probably not
getting the right encoding but can determine it later to only store
the input as a bytevector and later decode it correctly. Or if you
already have the string you could encode it back to a bytevector with
the wrong guessed encoding (which should emit the original input I
think) and then re-decode it with the right encoding. Wouldn't that be
the same solution as adding a primitive which does the same thing but
on some lower level?

> Another example is an
> application that wants to convert an encoded string into base-64 (or
> similar) form -- you'll need to encode the string internally first.

Here I don't have enough experience, but wouldn't you then just again
transform the string into a bytevector and further work with it?

> IOW, Guile needs a way to represent a string encoded in something
> other than UTF-8, and convert between UTF-8 and other encodings.

I think strings should be encoding `independent', so you don't have to
mind that if you don't need to, and if you're working with a special
encoding you're working on a representation of the `text' as a number
of characters encoded in some numbers, so you use a bytevector.

The only thing I'm not sure about is whether guile supports encoding a
string (into a bytevector) in some other format than UTF-8, so if
there don't exist other procedures I would suggest adding a string to
bytevector decoder which takes an encoder and the encoders (or just
procedures which convert the string directly into a bytevector in a
specific encoding).

WDYT?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: I'm looking for a method of converting a string's character encoding
  2012-04-30 10:18         ` Daniel Krueger
@ 2012-04-30 12:21           ` Eli Zaretskii
  2012-05-03 22:34           ` Ludovic Courtès
  1 sibling, 0 replies; 15+ messages in thread
From: Eli Zaretskii @ 2012-04-30 12:21 UTC (permalink / raw)
  To: Daniel Krueger; +Cc: guile-user, ttn, sunjoong

> Date: Mon, 30 Apr 2012 12:18:59 +0200
> From: Daniel Krueger <keenbug@googlemail.com>
> Cc: ttn@gnuvola.org, guile-user@gnu.org, sunjoong@gmail.com
> 
> I think strings should be encoding `independent', so you don't have to
> mind that if you don't need to, and if you're working with a special
> encoding you're working on a representation of the `text' as a number
> of characters encoded in some numbers, so you use a bytevector.

That would do, I think.

> The only thing I'm not sure about is whether guile supports encoding a
> string (into a bytevector) in some other format than UTF-8, so if
> there don't exist other procedures I would suggest adding a string to
> bytevector decoder which takes an encoder and the encoders (or just
> procedures which convert the string directly into a bytevector in a
> specific encoding).
> 
> WDYT?

Sounds like a plan to me ;-)



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: I'm looking for a method of converting a string's character encoding
  2012-04-27 21:13 I'm looking for a method of converting a string's character encoding Sunjoong Lee
  2012-04-28  1:40 ` Sunjoong Lee
  2012-04-28 16:38 ` Sunjoong Lee
@ 2012-05-02  3:57 ` Daniel Hartwig
  2012-05-03  5:14 ` Sunjoong Lee
  2012-05-03 22:31 ` Ludovic Courtès
  4 siblings, 0 replies; 15+ messages in thread
From: Daniel Hartwig @ 2012-05-02  3:57 UTC (permalink / raw)
  To: guile-user

On 28 April 2012 05:13, Sunjoong Lee <sunjoong@gmail.com> wrote:
>
> Background;
> #:decode-body? keyword of http-get seems not to work properly; I should
> set #:decode-body? to false value and decode the contents body string
> manually. If a web page's charset be utf-8, there be no problem. If not, a
> problem occurs. decode-response-body of (web client) call decode-string with
> web page's charset. But real charset of bytevector is iso-8859-1, not web
> page's charset. If so, you should not let http-get use decode-response-body.

Hello

It seems you later made some headway on this, but just a note to clarify:

Bytevectors are raw data, they do not have an encoding.  Web ports are
set to ISO-8859-1 as this is an 8-bit encoding that can be read as raw
data.  The output of http-get with '#:decode-body #f' *should* be a
bytevector of exactly the bytes sent by the server.

This is mentioned in the comments for read-request:

 > (use-modules (web request))
 > ,d read-request
 Read an HTTP request from @var{port}, optionally attaching the given
 metadata, @var{meta}.

 As a side effect, sets the encoding on @var{port} to
 ISO-8859-1 (latin-1), so that reading one character reads one byte.  See
 the discussion of character sets in "HTTP Requests" in the manual, for
 more information.

Can you provide us with a couple of sites where http-get or
decode-string does not work properly?  Or was something else at play
here?  This would help to investigate what the issue is.  (I am lazy
today to find some, I think you must know of a few :-)

>
> After getting response-body with bytevector form, you should decode it with
> "iso-8859-1" like decode-string's manner. Then you'll get web page's
> contents body string; it's charset is what you see in response header.
>

Note that ISO-8859-1 does not cover much of Unicode so decoding the
bytevector as that will lose much data.

Regards

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: I'm looking for a method of converting a string's character encoding
  2012-04-27 21:13 I'm looking for a method of converting a string's character encoding Sunjoong Lee
                   ` (2 preceding siblings ...)
  2012-05-02  3:57 ` Daniel Hartwig
@ 2012-05-03  5:14 ` Sunjoong Lee
  2012-05-03 22:31 ` Ludovic Courtès
  4 siblings, 0 replies; 15+ messages in thread
From: Sunjoong Lee @ 2012-05-03  5:14 UTC (permalink / raw)
  To: Daniel Hartwig; +Cc: guile-user

[-- Attachment #1: Type: text/plain, Size: 764 bytes --]

Hi, Daniel;

2012/4/28 Daniel Hartwig <mandyke@gmail.com>
>
> Can you provide us with a couple of sites where http-get or
> decode-string does not work properly?  Or was something else at play
> here?  This would help to investigate what the issue is.  (I am lazy
> today to find some, I think you must know of a few :-)
>

To cut a long story short, http-get is innocent; I had misunderstood it,
sorry.

I had summarized some issues of http-get in the post,
http://lists.gnu.org/archive/html/guile-user/2012-05/msg00005.html . With
"Example 1 - working", my problem was resolved mostly. "Example 3 - not
working" is an unsolved issue but is not an encoding problem; I think it
may be a design issue of declare-uri-header! or string->uri.

Thank you for comments.

[-- Attachment #2: Type: text/html, Size: 1206 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: I'm looking for a method of converting a string's character encoding
  2012-04-27 21:13 I'm looking for a method of converting a string's character encoding Sunjoong Lee
                   ` (3 preceding siblings ...)
  2012-05-03  5:14 ` Sunjoong Lee
@ 2012-05-03 22:31 ` Ludovic Courtès
  4 siblings, 0 replies; 15+ messages in thread
From: Ludovic Courtès @ 2012-05-03 22:31 UTC (permalink / raw)
  To: guile-user

Hi,

Sunjoong Lee <sunjoong@gmail.com> skribis:

> I'm looking for a method of converting a string's character encoding from a
> certain codeset to utf-8. I know the string of Guile uses utf-8 and (read
> (open-bytevector-input-port (string->utf8 "hello"))) returns "hello" . But
> what if the string "hello" be encoded not utf-8 and you want to get utf-8
> converted string? What I want is like iconv.

Ports in Guile are both binary and textual.  This allows for things like:

  scheme@(guile-user)> (use-modules (rnrs io ports))
  scheme@(guile-user)> (define (string->enc s e)
                         (let ((p (with-fluids ((%default-port-encoding e))
                                    (open-input-string s))))
                           (get-bytevector-all p)))
  scheme@(guile-user)> (string->enc "hello" "UTF-16BE")
  $1 = #vu8(0 104 0 101 0 108 0 108 0 111)
  scheme@(guile-user)> (string->enc "hello" "ISO-8859-3")
  $2 = #vu8(104 101 108 108 111)
  scheme@(guile-user)> (use-modules (rnrs bytevectors))
  scheme@(guile-user)> (utf16->string $1)
  $3 = "hello"

You may also want to look at ‘string->pointer’ in (system foreign).

Does it answer your question?

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: I'm looking for a method of converting a string's character encoding
  2012-04-30 10:18         ` Daniel Krueger
  2012-04-30 12:21           ` Eli Zaretskii
@ 2012-05-03 22:34           ` Ludovic Courtès
  1 sibling, 0 replies; 15+ messages in thread
From: Ludovic Courtès @ 2012-05-03 22:34 UTC (permalink / raw)
  To: guile-user

Hi,

Daniel Krueger <keenbug@googlemail.com> skribis:

> The only thing I'm not sure about is whether guile supports encoding a
> string (into a bytevector) in some other format than UTF-8

It does, by virtue of mixed binary/textual ports (see my previous
message.)

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-05-03 22:34 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-04-27 21:13 I'm looking for a method of converting a string's character encoding Sunjoong Lee
2012-04-28  1:40 ` Sunjoong Lee
2012-04-28 16:38 ` Sunjoong Lee
2012-04-28 17:33   ` Thien-Thi Nguyen
2012-04-28 18:29     ` Daniel Krueger
2012-04-28 19:54       ` Thien-Thi Nguyen
2012-04-28 20:55       ` Eli Zaretskii
2012-04-28 22:42         ` Sunjoong Lee
2012-04-29  0:25         ` Sunjoong Lee
2012-04-30 10:18         ` Daniel Krueger
2012-04-30 12:21           ` Eli Zaretskii
2012-05-03 22:34           ` Ludovic Courtès
2012-05-02  3:57 ` Daniel Hartwig
2012-05-03  5:14 ` Sunjoong Lee
2012-05-03 22:31 ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).