iconv or something like that

unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed

* iconv or something like that
@ 2014-10-23 11:31 Konrad Makowski
  2014-10-23 18:00 ` Mark H Weaver
  0 siblings, 1 reply; 9+ messages in thread
From: Konrad Makowski @ 2014-10-23 11:31 UTC (permalink / raw)
  To: Guile User

Is there any solution to convert charset from one encoding to another?
I have database in iso-8859-2 but my script runs in utf-8. I use dbi module.

-- 
Konrad

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: iconv or something like that
  2014-10-23 11:31 iconv or something like that Konrad Makowski
@ 2014-10-23 18:00 ` Mark H Weaver
  2014-10-23 18:07   ` Greg Troxel
  2014-10-25  7:03   ` Konrad Makowski
  0 siblings, 2 replies; 9+ messages in thread
From: Mark H Weaver @ 2014-10-23 18:00 UTC (permalink / raw)
  To: Konrad Makowski; +Cc: Guile User

Konrad Makowski <poczta@konradmakowski.pl> writes:
> Is there any solution to convert charset from one encoding to another?

Yes, but character encodings are only relevant when converting between a
sequence of _bytes_ (a bytevector), and a sequence of _characters_ [*]
(a string).  These conversions happen implicitly while performing I/O,
converting Scheme strings to/from C, etc.

[*] More precisely, Scheme strings are sequences of unicode code points.

It doesn't make sense to talk about the encoding of a Scheme string, or
to convert a Scheme string from one encoding to another, because they
are not byte sequences.

It sounds like you already have a Scheme string that was incorrectly
decoded from bytes, and are asking how to fix it up.  Unfortunately,
this won't work, because many valid ISO-8859-2 byte sequences are not
valid UTF-8, and will therefore lead to decoding errors.

> I have database in iso-8859-2 but my script runs in utf-8. I use dbi module.

Having looked at the guile-dbi source code, I see that it always uses
the current locale encoding when talking to databases.  Specifically, it
always uses 'scm_from_locale_string' and 'scm_to_locale_string'.  For
your purposes, you'd like it to use 'scm_from_stringn' and
'scm_to_stringn' instead, with "ISO-8859-2" as the 'encoding' argument.

My knowledge of modern databases is limited, so I'm not sure how this
problem is normally dealt with.  It seems to me that, ideally, strings
in databases should be sequences of Unicode code points, rather than
sequences of bytes.  If that were the case, then this problem wouldn't
arise.

It would be good if someone with more knowledge of databases would chime
in here.

In the meantime, I can see a few possible solutions/workarounds:

* Enhance guile-dbi to include an 'encoding' field to its database
  handles, add a new API procedure to set it, and use it in all the
  appropriate places.  This only makes sense if database strings are
  conceptually byte sequences, otherwise it should probably be fixed in
  some other way.

* Hack your local copy of guile-dbi to use 'scm_from_stringn' and
  'scm_to_stringn' with a hard-coded "ISO-8859-2" in the appropriate
  places.

* Use 'setlocale' to set a ISO-8859-2 locale temporarily while
  performing database queries.

Which database are you using?

     Mark

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: iconv or something like that
  2014-10-23 18:00 ` Mark H Weaver
@ 2014-10-23 18:07   ` Greg Troxel
  2014-10-25  7:03   ` Konrad Makowski
  1 sibling, 0 replies; 9+ messages in thread
From: Greg Troxel @ 2014-10-23 18:07 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Guile User

[-- Attachment #1: Type: text/plain, Size: 487 bytes --]

Mark H Weaver <mhw@netris.org> writes:

> It would be good if someone with more knowledge of databases would chime
> in here.

I'm pretty fuzzy on the details, but postgresql can handle the concept
of varying encodings and converting on the fly, so that you can tell a
session that you want a particular encoding.  So if guile is set up to
have unicode always, then presumably the db->guile conversions should
just do that.

http://www.postgresql.org/docs/current/static/multibyte.html

[-- Attachment #2: Type: application/pgp-signature, Size: 180 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: iconv or something like that
  2014-10-23 18:00 ` Mark H Weaver
  2014-10-23 18:07   ` Greg Troxel
@ 2014-10-25  7:03   ` Konrad Makowski
  2014-10-25  8:24     ` Konrad Makowski
  1 sibling, 1 reply; 9+ messages in thread
From: Konrad Makowski @ 2014-10-25  7:03 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Guile User

I'm using MySQL. And figure out that if i send query: "SET NAMES utf8" 
or "SET NAMES utf8 COLLATE utf8_general_ci" to the database (in terminal 
for example) mysql converts for me charset of returned data. But if i do 
the same in my guile script it reports error:
In ice-9/boot-9.scm:
  157: 9 [catch #t #<catch-closure 1cff400> ...]
In unknown file:
    ?: 8 [apply-smob/1 #<catch-closure 1cff400>]
In ice-9/boot-9.scm:
   63: 7 [call-with-prompt prompt0 ...]
In ice-9/eval.scm:
  432: 6 [eval # #]
  432: 5 [eval # #]
  387: 4 [eval # #]
  387: 3 [eval # #]
  387: 2 [eval # #]
  387: 1 [eval # #]
In unknown file:
    ?: 0 [utf8->string #vu8(80 65 87 69 163)]

ERROR: In procedure utf8->string:
ERROR: Throw to key `decoding-error' with args `("scm_from_stringn" 
"input locale conversion error" 84 #vu8(80 65 87 69 163))'.

My locale say that:
LANG=pl_PL.UTF-8
LANGUAGE=pl:en
LC_CTYPE="pl_PL.UTF-8"
LC_NUMERIC="pl_PL.UTF-8"
LC_TIME="pl_PL.UTF-8"
LC_COLLATE="pl_PL.UTF-8"
LC_MONETARY="pl_PL.UTF-8"
LC_MESSAGES="pl_PL.UTF-8"
LC_PAPER="pl_PL.UTF-8"
LC_NAME="pl_PL.UTF-8"
LC_ADDRESS="pl_PL.UTF-8"
LC_TELEPHONE="pl_PL.UTF-8"
LC_MEASUREMENT="pl_PL.UTF-8"
LC_IDENTIFICATION="pl_PL.UTF-8"
LC_ALL=pl_PL.UTF-8

Any idea?

Konrad

W dniu 23.10.2014 o 20:00, Mark H Weaver pisze:
> Konrad Makowski <poczta@konradmakowski.pl> writes:
>> Is there any solution to convert charset from one encoding to another?
> Yes, but character encodings are only relevant when converting between a
> sequence of _bytes_ (a bytevector), and a sequence of _characters_ [*]
> (a string).  These conversions happen implicitly while performing I/O,
> converting Scheme strings to/from C, etc.
>
> [*] More precisely, Scheme strings are sequences of unicode code points.
>
> It doesn't make sense to talk about the encoding of a Scheme string, or
> to convert a Scheme string from one encoding to another, because they
> are not byte sequences.
>
> It sounds like you already have a Scheme string that was incorrectly
> decoded from bytes, and are asking how to fix it up.  Unfortunately,
> this won't work, because many valid ISO-8859-2 byte sequences are not
> valid UTF-8, and will therefore lead to decoding errors.
>
>> I have database in iso-8859-2 but my script runs in utf-8. I use dbi module.
> Having looked at the guile-dbi source code, I see that it always uses
> the current locale encoding when talking to databases.  Specifically, it
> always uses 'scm_from_locale_string' and 'scm_to_locale_string'.  For
> your purposes, you'd like it to use 'scm_from_stringn' and
> 'scm_to_stringn' instead, with "ISO-8859-2" as the 'encoding' argument.
>
> My knowledge of modern databases is limited, so I'm not sure how this
> problem is normally dealt with.  It seems to me that, ideally, strings
> in databases should be sequences of Unicode code points, rather than
> sequences of bytes.  If that were the case, then this problem wouldn't
> arise.
>
> It would be good if someone with more knowledge of databases would chime
> in here.
>
> In the meantime, I can see a few possible solutions/workarounds:
>
> * Enhance guile-dbi to include an 'encoding' field to its database
>    handles, add a new API procedure to set it, and use it in all the
>    appropriate places.  This only makes sense if database strings are
>    conceptually byte sequences, otherwise it should probably be fixed in
>    some other way.
>
> * Hack your local copy of guile-dbi to use 'scm_from_stringn' and
>    'scm_to_stringn' with a hard-coded "ISO-8859-2" in the appropriate
>    places.
>
> * Use 'setlocale' to set a ISO-8859-2 locale temporarily while
>    performing database queries.
>
> Which database are you using?
>
>       Mark
>




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: iconv or something like that
  2014-10-25  7:03   ` Konrad Makowski
@ 2014-10-25  8:24     ` Konrad Makowski
  2014-10-25 18:51       ` Thien-Thi Nguyen
  0 siblings, 1 reply; 9+ messages in thread
From: Konrad Makowski @ 2014-10-25  8:24 UTC (permalink / raw)
  To: guile-user

O problem resolved and was not related to mysql or locale but my mistake.

Konrad

W dniu 25.10.2014 o 09:03, Konrad Makowski pisze:
> I'm using MySQL. And figure out that if i send query: "SET NAMES utf8" 
> or "SET NAMES utf8 COLLATE utf8_general_ci" to the database (in 
> terminal for example) mysql converts for me charset of returned data. 
> But if i do the same in my guile script it reports error:
> In ice-9/boot-9.scm:
>  157: 9 [catch #t #<catch-closure 1cff400> ...]
> In unknown file:
>    ?: 8 [apply-smob/1 #<catch-closure 1cff400>]
> In ice-9/boot-9.scm:
>   63: 7 [call-with-prompt prompt0 ...]
> In ice-9/eval.scm:
>  432: 6 [eval # #]
>  432: 5 [eval # #]
>  387: 4 [eval # #]
>  387: 3 [eval # #]
>  387: 2 [eval # #]
>  387: 1 [eval # #]
> In unknown file:
>    ?: 0 [utf8->string #vu8(80 65 87 69 163)]
>
> ERROR: In procedure utf8->string:
> ERROR: Throw to key `decoding-error' with args `("scm_from_stringn" 
> "input locale conversion error" 84 #vu8(80 65 87 69 163))'.
>
> My locale say that:
> LANG=pl_PL.UTF-8
> LANGUAGE=pl:en
> LC_CTYPE="pl_PL.UTF-8"
> LC_NUMERIC="pl_PL.UTF-8"
> LC_TIME="pl_PL.UTF-8"
> LC_COLLATE="pl_PL.UTF-8"
> LC_MONETARY="pl_PL.UTF-8"
> LC_MESSAGES="pl_PL.UTF-8"
> LC_PAPER="pl_PL.UTF-8"
> LC_NAME="pl_PL.UTF-8"
> LC_ADDRESS="pl_PL.UTF-8"
> LC_TELEPHONE="pl_PL.UTF-8"
> LC_MEASUREMENT="pl_PL.UTF-8"
> LC_IDENTIFICATION="pl_PL.UTF-8"
> LC_ALL=pl_PL.UTF-8
>
> Any idea?
>
> Konrad
>
> W dniu 23.10.2014 o 20:00, Mark H Weaver pisze:
>> Konrad Makowski <poczta@konradmakowski.pl> writes:
>>> Is there any solution to convert charset from one encoding to another?
>> Yes, but character encodings are only relevant when converting between a
>> sequence of _bytes_ (a bytevector), and a sequence of _characters_ [*]
>> (a string).  These conversions happen implicitly while performing I/O,
>> converting Scheme strings to/from C, etc.
>>
>> [*] More precisely, Scheme strings are sequences of unicode code points.
>>
>> It doesn't make sense to talk about the encoding of a Scheme string, or
>> to convert a Scheme string from one encoding to another, because they
>> are not byte sequences.
>>
>> It sounds like you already have a Scheme string that was incorrectly
>> decoded from bytes, and are asking how to fix it up. Unfortunately,
>> this won't work, because many valid ISO-8859-2 byte sequences are not
>> valid UTF-8, and will therefore lead to decoding errors.
>>
>>> I have database in iso-8859-2 but my script runs in utf-8. I use dbi 
>>> module.
>> Having looked at the guile-dbi source code, I see that it always uses
>> the current locale encoding when talking to databases. Specifically, it
>> always uses 'scm_from_locale_string' and 'scm_to_locale_string'.  For
>> your purposes, you'd like it to use 'scm_from_stringn' and
>> 'scm_to_stringn' instead, with "ISO-8859-2" as the 'encoding' argument.
>>
>> My knowledge of modern databases is limited, so I'm not sure how this
>> problem is normally dealt with.  It seems to me that, ideally, strings
>> in databases should be sequences of Unicode code points, rather than
>> sequences of bytes.  If that were the case, then this problem wouldn't
>> arise.
>>
>> It would be good if someone with more knowledge of databases would chime
>> in here.
>>
>> In the meantime, I can see a few possible solutions/workarounds:
>>
>> * Enhance guile-dbi to include an 'encoding' field to its database
>>    handles, add a new API procedure to set it, and use it in all the
>>    appropriate places.  This only makes sense if database strings are
>>    conceptually byte sequences, otherwise it should probably be fixed in
>>    some other way.
>>
>> * Hack your local copy of guile-dbi to use 'scm_from_stringn' and
>>    'scm_to_stringn' with a hard-coded "ISO-8859-2" in the appropriate
>>    places.
>>
>> * Use 'setlocale' to set a ISO-8859-2 locale temporarily while
>>    performing database queries.
>>
>> Which database are you using?
>>
>>       Mark
>>
>
>
>




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: iconv or something like that
  2014-10-25  8:24     ` Konrad Makowski
@ 2014-10-25 18:51       ` Thien-Thi Nguyen
  2014-10-25 20:25         ` Konrad Makowski
  0 siblings, 1 reply; 9+ messages in thread
From: Thien-Thi Nguyen @ 2014-10-25 18:51 UTC (permalink / raw)
  To: Konrad Makowski; +Cc: guile-user

[-- Attachment #1: Type: text/plain, Size: 508 bytes --]

() Konrad Makowski <poczta@konradmakowski.pl>
() Sat, 25 Oct 2014 10:24:34 +0200

   O problem resolved and was not related to
   mysql or locale but my mistake.

If you explain the mistake now, maybe it will be harder
for others (and perhaps yourself) to make it, later.

-- 
Thien-Thi Nguyen
   GPG key: 4C807502
   (if you're human and you know it)
      read my lisp: (responsep (questions 'technical)
                               (not (via 'mailing-list)))
                     => nil

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: iconv or something like that
  2014-10-25 18:51       ` Thien-Thi Nguyen
@ 2014-10-25 20:25         ` Konrad Makowski
  2014-10-26  9:50           ` Thien-Thi Nguyen
  0 siblings, 1 reply; 9+ messages in thread
From: Konrad Makowski @ 2014-10-25 20:25 UTC (permalink / raw)
  To: guile-user

I know what you mean, but this is silly mistake. I use my procedure 
(define (iconv str from to) (bytevector->string (string->bytevector str 
from) to)) and try to convert charset of string. After that i figure out 
that mysql can convert charset with "SET NAMES UTF-8". Error occur 
because i not disable this wrong procedure from code.

Konrad

W dniu 25.10.2014 o 20:51, Thien-Thi Nguyen pisze:
> () Konrad Makowski <poczta@konradmakowski.pl>
> () Sat, 25 Oct 2014 10:24:34 +0200
>
>     O problem resolved and was not related to
>     mysql or locale but my mistake.
>
> If you explain the mistake now, maybe it will be harder
> for others (and perhaps yourself) to make it, later.
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: iconv or something like that
  2014-10-25 20:25         ` Konrad Makowski
@ 2014-10-26  9:50           ` Thien-Thi Nguyen
  2014-10-26 16:21             ` Barry Schwartz
  0 siblings, 1 reply; 9+ messages in thread
From: Thien-Thi Nguyen @ 2014-10-26  9:50 UTC (permalink / raw)
  To: Konrad Makowski; +Cc: guile-user

[-- Attachment #1: Type: text/plain, Size: 711 bytes --]

() Konrad Makowski <poczta@konradmakowski.pl>
() Sat, 25 Oct 2014 22:25:26 +0200

   I know what you mean, but this is silly mistake.

No worries.

   [missed decrufting opportunity]

Thanks.  I think the more the Free Software world honestly
shares its mistakes, silly or what have you, the lighter
(and, curiously, stronger) the atmosphere becomes.

Of course, if you never intend the software you author to
be Free, then that's a different discussion altogether...

-- 
Thien-Thi Nguyen
   GPG key: 4C807502
   (if you're human and you know it)
      read my lisp: (responsep (questions 'technical)
                               (not (via 'mailing-list)))
                     => nil

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: iconv or something like that
  2014-10-26  9:50           ` Thien-Thi Nguyen
@ 2014-10-26 16:21             ` Barry Schwartz
  0 siblings, 0 replies; 9+ messages in thread
From: Barry Schwartz @ 2014-10-26 16:21 UTC (permalink / raw)
  To: guile-user

Thien-Thi Nguyen <ttn@gnu.org> skribis:
> Thanks.  I think the more the Free Software world honestly
> shares its mistakes, silly or what have you, the lighter
> (and, curiously, stronger) the atmosphere becomes.

That’s why I generally dislike it when people remove commits/branches
in version control software that allows such behavior. You are
removing the history of errors, false starts, etc.



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-10-26 16:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-23 11:31 iconv or something like that Konrad Makowski
2014-10-23 18:00 ` Mark H Weaver
2014-10-23 18:07   ` Greg Troxel
2014-10-25  7:03   ` Konrad Makowski
2014-10-25  8:24     ` Konrad Makowski
2014-10-25 18:51       ` Thien-Thi Nguyen
2014-10-25 20:25         ` Konrad Makowski
2014-10-26  9:50           ` Thien-Thi Nguyen
2014-10-26 16:21             ` Barry Schwartz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).