From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Sunjoong Lee Newsgroups: gmane.lisp.guile.user Subject: Re: I'm looking for a method of converting a string's character encoding Date: Sun, 29 Apr 2012 07:42:29 +0900 Message-ID: References: <87obqbwykh.fsf@gnuvola.org> <834ns37f0b.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=f46d04182582df065704bec4f0cd X-Trace: dough.gmane.org 1335652986 13848 80.91.229.3 (28 Apr 2012 22:43:06 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Sat, 28 Apr 2012 22:43:06 +0000 (UTC) Cc: guile-user@gnu.org, Daniel Krueger , ttn@gnuvola.org To: Eli Zaretskii Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Sun Apr 29 00:43:05 2012 Return-path: Envelope-to: guile-user@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SOGM3-0000J3-5G for guile-user@m.gmane.org; Sun, 29 Apr 2012 00:43:03 +0200 Original-Received: from localhost ([::1]:58045 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SOGM2-0001GZ-Eh for guile-user@m.gmane.org; Sat, 28 Apr 2012 18:43:02 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:36604) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SOGLw-0001GI-6f for guile-user@gnu.org; Sat, 28 Apr 2012 18:42:58 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1SOGLu-0006Jd-2F for guile-user@gnu.org; Sat, 28 Apr 2012 18:42:55 -0400 Original-Received: from mail-wi0-f177.google.com ([209.85.212.177]:58116) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SOGLt-0006JQ-Mo; Sat, 28 Apr 2012 18:42:54 -0400 Original-Received: by wibhj13 with SMTP id hj13so1357399wib.12 for ; Sat, 28 Apr 2012 15:42:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=seZ1EcuFaevApsuDSJ8g1Bs0TTw+cfs+AwTgmU/BD5c=; b=VmxxyG6r1qyRDf1DbQMAFyQVuDYkk/vVCuldbb1xhbag8AWJ+R1e+iTiwRNe8yzHLr KpPJhi2aqJLAbXDRawGshcC16u03gx3d+B8kf2iZWB1c4ZtWlxQqrFTlcSUDSSqZu7nt 7qIWSo4nf27djCS0pB0nAYtJzL04/7j7Rv8hYKWESYWg6vAlYOrxqZB4cfvgkQzlZWdv zMHRlv29BvKpiMEWdTX1oE3KRB5W1xKKdEjT+eVaj9q7pLDKRHvtAuCu2qg2u4917Q4Z HitNx6QbtSGUWqJbA2n7T7Dl3TLeM259MaV+H1TbnddAHLbhSZ5BZ1PxHyTmjwcuayiP Fybw== Original-Received: by 10.180.105.194 with SMTP id go2mr8646613wib.22.1335652970857; Sat, 28 Apr 2012 15:42:50 -0700 (PDT) Original-Received: by 10.223.93.206 with HTTP; Sat, 28 Apr 2012 15:42:29 -0700 (PDT) In-Reply-To: <834ns37f0b.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 209.85.212.177 X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Original-Sender: guile-user-bounces+guile-user=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.user:9420 Archived-At: --f46d04182582df065704bec4f0cd Content-Type: text/plain; charset=UTF-8 Thanks hien-Thi, Daniel and Eli. Eli pointed a good example; I'll say another one. In the countries, it's character encoded multibytes, like China, Japan and Korea (i.e., in CJKs), it would be a common issue to convert codeset. In Korea, a certain web page may be written by EUC-KR codeset and another by UTF-8. In Japan, Shift-JIS, EUC-JP, ISO-2022-JP and UTF-8. In China, GBK, gb18030, Big5, Big5-HKSCS and UTF-8. I mean that koreans use 2 different codesets, japanese 4, chinese 5 in the net. It seems not to happen comparing chinese web page and korean web page with a same program but... Suppose you want to write a program monitoring web pages, the codeset converter would be need. Just in CJKs? Greeks use 3 codesets, vietnamese 2, arabs 3, and so on. It looks like that russians use many codesets like chinese. 2012/4/29 Eli Zaretskii > > Date: Sat, 28 Apr 2012 20:29:22 +0200 > > From: Daniel Krueger > > Cc: guile-user@gnu.org, Sunjoong Lee > > > > i think there shouldn't be any transcoding of guile's strings, as > > strings are internal representation of characters, no matter how they > > are encoded. So the only time when encoding matters is when it passes > > it's `internal boundarys', i mean if you write the string to a port or > > read from a port or pass it as a string to a foreign library. For the > > ports all transcoding is available, and as said, the real > > representation of guile strings internally is as utf8, which can't be > > changed. The only additional thing i forgot about are bytevectors, if > > you convert a string to an explicit representation, but afaik there > > you also can give the encoding to use. > > > > Am I wrong? > > You are mostly right, but only "mostly". Experience teaches that > sometimes you need to change encoding even inside "the boundaries". > One notable example is when the original encoding was determined > incorrectly, and the application wants to "re-decode" the string, when > its external origin is no longer available. Another example is an > application that wants to convert an encoded string into base-64 (or > similar) form -- you'll need to encode the string internally first. > > These kinds of rare, but still important, use cases are the reason why > Emacs Lisp has primitives to do encoding and decoding of in-memory > strings; as much as Emacs maintainers want to get rid of the related > need to support "unibyte strings", they are not going to go away any > time soon. > > IOW, Guile needs a way to represent a string encoded in something > other than UTF-8, and convert between UTF-8 and other encodings. > --f46d04182582df065704bec4f0cd Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks=C2=A0hien-Thi,=C2=A0Daniel and Eli.
=

Eli pointed= a good example; I'll say another one. In the countries, it's chara= cter encoded multibytes, like China, Japan and Korea (i.e., in CJKs), it wo= uld be a common issue to convert codeset. In Korea, a certain web page may = be written by EUC-KR codeset and another by UTF-8. In Japan, Shift-JIS, EUC= -JP, ISO-2022-JP and UTF-8. In China, GBK, gb18030, Big5, Big5-HKSCS and UT= F-8. I mean that koreans use 2 different codesets, japanese 4, chinese 5 in= the net.

It seems no= t to happen comparing chinese web page and korean web page with a same prog= ram but... Suppose you want to write a program monitoring web pages, the co= deset converter would be need. Just in CJKs? Greeks use 3 codesets, vietnam= ese 2, arabs 3, and so on. It looks like that russians use many codesets li= ke=C2=A0chinese.

2012/4/29 Eli Zaretskii <eliz@gnu.org>
> Date: Sat, 28 Apr 2012 20:29:22 +0200
> From: Daniel Krueger <kee= nbug@googlemail.com>
> Cc: guile-user@gnu.org, Sunj= oong Lee <sunjoong@gmail.com&g= t;
>
> i think there shouldn't be any transcoding of guile's strings,= as
> strings are internal representation of characters, no matter how they<= br> > are encoded. So the only time when encoding matters is when it passes<= br> > it's `internal boundarys', i mean if you write the string to a= port or
> read from a port or pass it as a string to a foreign library. For the<= br> > ports all transcoding is available, and as said, the real
> representation of guile strings internally is as utf8, which can't= be
> changed. The only additional thing i forgot about are bytevectors, if<= br> > you convert a string to an explicit representation, but afaik there > you also can give the encoding to use.
>
> Am I wrong?

You are mostly right, but only "mostly". =C2=A0Experience t= eaches that
sometimes you need to change encoding even inside "the boundaries"= ;.
One notable example is when the original encoding was determined
incorrectly, and the application wants to "re-decode" the string,= when
its external origin is no longer available. =C2=A0Another example is an
application that wants to convert an encoded string into base-64 (or
similar) form -- you'll need to encode the string internally first.

These kinds of rare, but still important, use cases are the reason why
Emacs Lisp has primitives to do encoding and decoding of in-memory
strings; as much as Emacs maintainers want to get rid of the related
need to support "unibyte strings", they are not going to go away = any
time soon.

IOW, Guile needs a way to represent a string encoded in something
other than UTF-8, and convert between UTF-8 and other encodings.

--f46d04182582df065704bec4f0cd--