From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Sunjoong Lee <sunjoong@gmail.com>
Newsgroups: gmane.lisp.guile.user
Subject: Re: I'm looking for a method of converting a string's character
	encoding
Date: Sun, 29 Apr 2012 07:42:29 +0900
Message-ID: <CAK93xhoGxhyiFawgHvgDw6ZXuQiaVXtZnQoOqwPjmgKy7u72sg@mail.gmail.com>
References: <CAK93xhpSbFQM-FuPAhQVKJgtZHZKO7b2yO--ciNZPenQA_3xkg@mail.gmail.com>
	<CAK93xhoF=6Kf5hunyzhHDr_Ain_i54VKgERhQV6eO+Q9-cnoCw@mail.gmail.com>
	<87obqbwykh.fsf@gnuvola.org>
	<CAAh5vOP=ZDsPMSxV49r6uxomr6D3D2-8wP-9OcMQrCzWTE5Sew@mail.gmail.com>
	<834ns37f0b.fsf@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary=f46d04182582df065704bec4f0cd
X-Trace: dough.gmane.org 1335652986 13848 80.91.229.3 (28 Apr 2012 22:43:06 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Sat, 28 Apr 2012 22:43:06 +0000 (UTC)
Cc: guile-user@gnu.org, Daniel Krueger <keenbug@googlemail.com>,
	ttn@gnuvola.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Sun Apr 29 00:43:05 2012
Return-path: <guile-user-bounces+guile-user=m.gmane.org@gnu.org>
Envelope-to: guile-user@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <guile-user-bounces+guile-user=m.gmane.org@gnu.org>)
	id 1SOGM3-0000J3-5G
	for guile-user@m.gmane.org; Sun, 29 Apr 2012 00:43:03 +0200
Original-Received: from localhost ([::1]:58045 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <guile-user-bounces+guile-user=m.gmane.org@gnu.org>)
	id 1SOGM2-0001GZ-Eh
	for guile-user@m.gmane.org; Sat, 28 Apr 2012 18:43:02 -0400
Original-Received: from eggs.gnu.org ([208.118.235.92]:36604)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <sunjoong@gmail.com>) id 1SOGLw-0001GI-6f
	for guile-user@gnu.org; Sat, 28 Apr 2012 18:42:58 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <sunjoong@gmail.com>) id 1SOGLu-0006Jd-2F
	for guile-user@gnu.org; Sat, 28 Apr 2012 18:42:55 -0400
Original-Received: from mail-wi0-f177.google.com ([209.85.212.177]:58116)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <sunjoong@gmail.com>)
	id 1SOGLt-0006JQ-Mo; Sat, 28 Apr 2012 18:42:54 -0400
Original-Received: by wibhj13 with SMTP id hj13so1357399wib.12
	for <multiple recipients>; Sat, 28 Apr 2012 15:42:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:in-reply-to:references:from:date:message-id:subject:to
	:cc:content-type;
	bh=seZ1EcuFaevApsuDSJ8g1Bs0TTw+cfs+AwTgmU/BD5c=;
	b=VmxxyG6r1qyRDf1DbQMAFyQVuDYkk/vVCuldbb1xhbag8AWJ+R1e+iTiwRNe8yzHLr
	KpPJhi2aqJLAbXDRawGshcC16u03gx3d+B8kf2iZWB1c4ZtWlxQqrFTlcSUDSSqZu7nt
	7qIWSo4nf27djCS0pB0nAYtJzL04/7j7Rv8hYKWESYWg6vAlYOrxqZB4cfvgkQzlZWdv
	zMHRlv29BvKpiMEWdTX1oE3KRB5W1xKKdEjT+eVaj9q7pLDKRHvtAuCu2qg2u4917Q4Z
	HitNx6QbtSGUWqJbA2n7T7Dl3TLeM259MaV+H1TbnddAHLbhSZ5BZ1PxHyTmjwcuayiP
	Fybw==
Original-Received: by 10.180.105.194 with SMTP id go2mr8646613wib.22.1335652970857;
	Sat, 28 Apr 2012 15:42:50 -0700 (PDT)
Original-Received: by 10.223.93.206 with HTTP; Sat, 28 Apr 2012 15:42:29 -0700 (PDT)
In-Reply-To: <834ns37f0b.fsf@gnu.org>
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 209.85.212.177
X-BeenThere: guile-user@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: General Guile related discussions <guile-user.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guile-user>,
	<mailto:guile-user-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guile-user>
List-Post: <mailto:guile-user@gnu.org>
List-Help: <mailto:guile-user-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guile-user>,
	<mailto:guile-user-request@gnu.org?subject=subscribe>
Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org
Original-Sender: guile-user-bounces+guile-user=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.lisp.guile.user:9420
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.user/9420>

--f46d04182582df065704bec4f0cd
Content-Type: text/plain; charset=UTF-8

Thanks hien-Thi, Daniel and Eli.

Eli pointed a good example; I'll say another one. In the countries, it's
character encoded multibytes, like China, Japan and Korea (i.e., in CJKs),
it would be a common issue to convert codeset. In Korea, a certain web page
may be written by EUC-KR codeset and another by UTF-8. In Japan, Shift-JIS,
EUC-JP, ISO-2022-JP and UTF-8. In China, GBK, gb18030, Big5, Big5-HKSCS and
UTF-8. I mean that koreans use 2 different codesets, japanese 4, chinese 5
in the net.

It seems not to happen comparing chinese web page and korean web page with
a same program but... Suppose you want to write a program monitoring web
pages, the codeset converter would be need. Just in CJKs? Greeks use 3
codesets, vietnamese 2, arabs 3, and so on. It looks like that russians use
many codesets like chinese.

2012/4/29 Eli Zaretskii <eliz@gnu.org>

> > Date: Sat, 28 Apr 2012 20:29:22 +0200
> > From: Daniel Krueger <keenbug@googlemail.com>
> > Cc: guile-user@gnu.org, Sunjoong Lee <sunjoong@gmail.com>
> >
> > i think there shouldn't be any transcoding of guile's strings, as
> > strings are internal representation of characters, no matter how they
> > are encoded. So the only time when encoding matters is when it passes
> > it's `internal boundarys', i mean if you write the string to a port or
> > read from a port or pass it as a string to a foreign library. For the
> > ports all transcoding is available, and as said, the real
> > representation of guile strings internally is as utf8, which can't be
> > changed. The only additional thing i forgot about are bytevectors, if
> > you convert a string to an explicit representation, but afaik there
> > you also can give the encoding to use.
> >
> > Am I wrong?
>
> You are mostly right, but only "mostly".  Experience teaches that
> sometimes you need to change encoding even inside "the boundaries".
> One notable example is when the original encoding was determined
> incorrectly, and the application wants to "re-decode" the string, when
> its external origin is no longer available.  Another example is an
> application that wants to convert an encoded string into base-64 (or
> similar) form -- you'll need to encode the string internally first.
>
> These kinds of rare, but still important, use cases are the reason why
> Emacs Lisp has primitives to do encoding and decoding of in-memory
> strings; as much as Emacs maintainers want to get rid of the related
> need to support "unibyte strings", they are not going to go away any
> time soon.
>
> IOW, Guile needs a way to represent a string encoded in something
> other than UTF-8, and convert between UTF-8 and other encodings.
>

--f46d04182582df065704bec4f0cd
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div class=3D"gmail_extra">Thanks=C2=A0hien-Thi,=C2=A0Daniel and Eli.</div>=
<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">Eli pointed=
 a good example; I&#39;ll say another one. In the countries, it&#39;s chara=
cter encoded multibytes, like China, Japan and Korea (i.e., in CJKs), it wo=
uld be a common issue to convert codeset. In Korea, a certain web page may =
be written by EUC-KR codeset and another by UTF-8. In Japan, Shift-JIS, EUC=
-JP, ISO-2022-JP and UTF-8. In China, GBK, gb18030, Big5, Big5-HKSCS and UT=
F-8. I mean that koreans use 2 different codesets, japanese 4, chinese 5 in=
 the net.</div>

<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">It seems no=
t to happen comparing chinese web page and korean web page with a same prog=
ram but... Suppose you want to write a program monitoring web pages, the co=
deset converter would be need. Just in CJKs? Greeks use 3 codesets, vietnam=
ese 2, arabs 3, and so on. It looks like that russians use many codesets li=
ke=C2=A0chinese.</div>

<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra"><div class=
=3D"gmail_quote">2012/4/29 Eli Zaretskii <span dir=3D"ltr">&lt;<a href=3D"m=
ailto:eliz@gnu.org" target=3D"_blank">eliz@gnu.org</a>&gt;</span><br><block=
quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc=
 solid;padding-left:1ex">

&gt; Date: Sat, 28 Apr 2012 20:29:22 +0200<br>
&gt; From: Daniel Krueger &lt;<a href=3D"mailto:keenbug@googlemail.com">kee=
nbug@googlemail.com</a>&gt;<br>
&gt; Cc: <a href=3D"mailto:guile-user@gnu.org">guile-user@gnu.org</a>, Sunj=
oong Lee &lt;<a href=3D"mailto:sunjoong@gmail.com">sunjoong@gmail.com</a>&g=
t;<br>
<div class=3D"im">&gt;<br>
&gt; i think there shouldn&#39;t be any transcoding of guile&#39;s strings,=
 as<br>
&gt; strings are internal representation of characters, no matter how they<=
br>
&gt; are encoded. So the only time when encoding matters is when it passes<=
br>
&gt; it&#39;s `internal boundarys&#39;, i mean if you write the string to a=
 port or<br>
&gt; read from a port or pass it as a string to a foreign library. For the<=
br>
&gt; ports all transcoding is available, and as said, the real<br>
&gt; representation of guile strings internally is as utf8, which can&#39;t=
 be<br>
&gt; changed. The only additional thing i forgot about are bytevectors, if<=
br>
&gt; you convert a string to an explicit representation, but afaik there<br=
>
&gt; you also can give the encoding to use.<br>
&gt;<br>
&gt; Am I wrong?<br>
<br>
</div>You are mostly right, but only &quot;mostly&quot;. =C2=A0Experience t=
eaches that<br>
sometimes you need to change encoding even inside &quot;the boundaries&quot=
;.<br>
One notable example is when the original encoding was determined<br>
incorrectly, and the application wants to &quot;re-decode&quot; the string,=
 when<br>
its external origin is no longer available. =C2=A0Another example is an<br>
application that wants to convert an encoded string into base-64 (or<br>
similar) form -- you&#39;ll need to encode the string internally first.<br>
<br>
These kinds of rare, but still important, use cases are the reason why<br>
Emacs Lisp has primitives to do encoding and decoding of in-memory<br>
strings; as much as Emacs maintainers want to get rid of the related<br>
need to support &quot;unibyte strings&quot;, they are not going to go away =
any<br>
time soon.<br>
<br>
IOW, Guile needs a way to represent a string encoded in something<br>
other than UTF-8, and convert between UTF-8 and other encodings.<br>
</blockquote></div><br></div>

--f46d04182582df065704bec4f0cd--