decode-coding-string question

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* decode-coding-string question
@ 2008-08-14 21:02 Ted Zlatanov
  2008-08-14 22:19 ` Dmitry Dzhus
  2008-08-14 22:20 ` David Golden
  0 siblings, 2 replies; 15+ messages in thread
From: Ted Zlatanov @ 2008-08-14 21:02 UTC (permalink / raw)
  To: help-gnu-emacs

This should decode to нуль but doesn't (I get the same string instead):

(decode-coding-string "íîëü" 'cp1251)

Am I missing something obvious?  Do I need to encode the string to
something else?

Ted


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: decode-coding-string question
  2008-08-14 21:02 decode-coding-string question Ted Zlatanov
@ 2008-08-14 22:19 ` Dmitry Dzhus
  2008-08-15  7:37   ` Eli Zaretskii
                     ` (2 more replies)
  2008-08-14 22:20 ` David Golden
  1 sibling, 3 replies; 15+ messages in thread
From: Dmitry Dzhus @ 2008-08-14 22:19 UTC (permalink / raw)
  To: help-gnu-emacs

Ted Zlatanov wrote:

> This should decode to нуль but doesn't (I get the same string instead):
>
> (decode-coding-string "íîëü" 'cp1251)
>
> Am I missing something obvious?  Do I need to encode the string to
> something else?

0. «íóëü», not «íîëü»

1. (decode-coding-string (string-make-unibyte "íóëü") 'cp1251)
-- 
Happy Hacking.

http://sphinx.net.ru
む


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: decode-coding-string question
  2008-08-14 21:02 decode-coding-string question Ted Zlatanov
  2008-08-14 22:19 ` Dmitry Dzhus
@ 2008-08-14 22:20 ` David Golden
  1 sibling, 0 replies; 15+ messages in thread
From: David Golden @ 2008-08-14 22:20 UTC (permalink / raw)
  To: help-gnu-emacs

Ted Zlatanov wrote:

> This should decode to нуль but doesn't (I get the same string
> instead):
> 
> (decode-coding-string "íîëü" 'cp1251)
> 
> Am I missing something obvious?  Do I need to encode the string to
> something else?
>
Guessing you're using a new multibyte/unicode emacs, and noting that I
do not currently fully understand  emacs encoding handling, but...
probably - try:

(decode-coding-string (encode-coding-string "íîëü" 'iso-8859-1) 'cp1251)





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: decode-coding-string question
  2008-08-14 22:19 ` Dmitry Dzhus
@ 2008-08-15  7:37   ` Eli Zaretskii
  2008-08-15 15:54   ` Ted Zlatanov
       [not found]   ` <mailman.16762.1218785885.18990.help-gnu-emacs@gnu.org>
  2 siblings, 0 replies; 15+ messages in thread
From: Eli Zaretskii @ 2008-08-15  7:37 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Dmitry Dzhus <dima@sphinx.net.ru>
> Date: Fri, 15 Aug 2008 02:19:11 +0400
> 
> Ted Zlatanov wrote:
> 
> > This should decode to нуль but doesn't (I get the same string instead):
> >
> > (decode-coding-string "íîëü" 'cp1251)
> >
> > Am I missing something obvious?  Do I need to encode the string to
> > something else?
> 
> 0. «íóëü», not «íîëü»

That's not important, the original problem remains, even with a
different spelling of the word.

> 1. (decode-coding-string (string-make-unibyte "íóëü") 'cp1251)

Yes, that's it: decode-coding-string works reliably only on unibyte
strings.  In multibyte strings, random byte values will be interpreted
according to rules that are appropriate for Emacs internal
representation of text in buffers and strings, not according to what
we humans expect.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: decode-coding-string question
  2008-08-14 22:19 ` Dmitry Dzhus
  2008-08-15  7:37   ` Eli Zaretskii
@ 2008-08-15 15:54   ` Ted Zlatanov
  2008-08-15 17:04     ` Eli Zaretskii
       [not found]     ` <mailman.16821.1218819933.18990.help-gnu-emacs@gnu.org>
       [not found]   ` <mailman.16762.1218785885.18990.help-gnu-emacs@gnu.org>
  2 siblings, 2 replies; 15+ messages in thread
From: Ted Zlatanov @ 2008-08-15 15:54 UTC (permalink / raw)
  To: help-gnu-emacs

On Fri, 15 Aug 2008 02:19:11 +0400 Dmitry Dzhus <dima@sphinx.net.ru> wrote: 

DD> (decode-coding-string (string-make-unibyte "íóëü") 'cp1251)

Thanks.

There should probably be a specific function for this:

(decode-coding-string-as-unibyte "íóëü" 'cp1251)

ditto for decode-coding-region.  Should I add it or is that not
generally useful?  A flag is not as good because both functions have
several flags already.

Ted


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: decode-coding-string question
       [not found]   ` <mailman.16762.1218785885.18990.help-gnu-emacs@gnu.org>
@ 2008-08-15 16:06     ` Dmitry Dzhus
  0 siblings, 0 replies; 15+ messages in thread
From: Dmitry Dzhus @ 2008-08-15 16:06 UTC (permalink / raw)
  To: help-gnu-emacs

Eli Zaretskii wrote:

>> From: Dmitry Dzhus <dima@sphinx.net.ru>
>> Date: Fri, 15 Aug 2008 02:19:11 +0400
>> 
>> Ted Zlatanov wrote:
>> 
>> > This should decode to нуль but doesn't (I get the same string instead):
>> >
>> > (decode-coding-string "íîëü" 'cp1251)
>> >
>> > Am I missing something obvious?  Do I need to encode the string to
>> > something else?
>> 
>> 0. «íóëü», not «íîëü»
>
> That's not important, the original problem remains, even with a
> different spelling of the word.

That was nitpicking somewhat irrelevant to unibyte-multibyte problem:
Ted expected to get «нуль», and it's «íóëü», though the string he
originally provided — «íîëü» — decodes to «ноль»; however, both «нуль»
and «ноль» mean «zero».
-- 
Happy Hacking.

http://sphinx.net.ru
む


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: decode-coding-string question
  2008-08-15 15:54   ` Ted Zlatanov
@ 2008-08-15 17:04     ` Eli Zaretskii
       [not found]     ` <mailman.16821.1218819933.18990.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 15+ messages in thread
From: Eli Zaretskii @ 2008-08-15 17:04 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Ted Zlatanov <tzz@lifelogs.com>
> Date: Fri, 15 Aug 2008 10:54:20 -0500
> 
> There should probably be a specific function for this:
> 
> (decode-coding-string-as-unibyte "íóëü" 'cp1251)
> 
> ditto for decode-coding-region.  Should I add it or is that not
> generally useful?

Personally, I think it's not useful, since decode-coding-region and
decode-coding-string are used only on unibyte text.  But feel free to
raise this on emacs-devel.





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: decode-coding-string question
       [not found]     ` <mailman.16821.1218819933.18990.help-gnu-emacs@gnu.org>
@ 2008-08-18 13:58       ` Ted Zlatanov
  2008-08-18 16:09         ` Nikolaj Schumacher
                           ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Ted Zlatanov @ 2008-08-18 13:58 UTC (permalink / raw)
  To: help-gnu-emacs

On Fri, 15 Aug 2008 20:04:42 +0300 Eli Zaretskii <eliz@gnu.org> wrote: 

>> From: Ted Zlatanov <tzz@lifelogs.com>
>> Date: Fri, 15 Aug 2008 10:54:20 -0500
>> 
>> There should probably be a specific function for this:
>> 
>> (decode-coding-string-as-unibyte "íóëü" 'cp1251)
>> 
>> ditto for decode-coding-region.  Should I add it or is that not
>> generally useful?

EZ> Personally, I think it's not useful, since decode-coding-region and
EZ> decode-coding-string are used only on unibyte text.  But feel free to
EZ> raise this on emacs-devel.

How would you recommend decoding text from particular encodings?  Given
text like the one shown above in a buffer, only decode-coding-region
seems to DTRT, and it's not interactive.

Context: I have a file full of CP1251 data and don't want to use Perl's
Encode module because I'm stubborn and think Emacs should handle it :)

On Fri, 15 Aug 2008 20:06:59 +0400 Dmitry Dzhus <dima@sphinx.net.ru> wrote: 

DD> That was nitpicking somewhat irrelevant to unibyte-multibyte problem:
DD> Ted expected to get «нуль», and it's «íóëü», though the string he
DD> originally provided — «íîëü» — decodes to «ноль»; however, both «нуль»
DD> and «ноль» mean «zero».

Yes, I was translating from Russian and knew the text said "zero" but
didn't remember the correct spelling.  Thanks for checking.

Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: decode-coding-string question
  2008-08-18 13:58       ` Ted Zlatanov
@ 2008-08-18 16:09         ` Nikolaj Schumacher
  2008-08-18 17:45         ` David Golden
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Nikolaj Schumacher @ 2008-08-18 16:09 UTC (permalink / raw)
  To: Ted Zlatanov; +Cc: help-gnu-emacs

Ted Zlatanov <tzz@lifelogs.com> wrote:

> Context: I have a file full of CP1251 data and don't want to use Perl's
> Encode module because I'm stubborn and think Emacs should handle it :)

Maybe `file-coding-system-alist' or `coding-system-for-read' are of help?


regards,
Nikolaj Schumacher




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: decode-coding-string question
  2008-08-18 13:58       ` Ted Zlatanov
  2008-08-18 16:09         ` Nikolaj Schumacher
@ 2008-08-18 17:45         ` David Golden
  2008-08-18 19:11         ` Eli Zaretskii
       [not found]         ` <mailman.16993.1219086714.18990.help-gnu-emacs@gnu.org>
  3 siblings, 0 replies; 15+ messages in thread
From: David Golden @ 2008-08-18 17:45 UTC (permalink / raw)
  To: help-gnu-emacs

Ted Zlatanov wrote:

> Context: I have a file full of CP1251 data and don't want to use
> Perl's Encode module because I'm stubborn and think Emacs should
> handle it :)

Just in case: If you have a file full of cp1251, and you know it's
cp1251, it' s usually  best to just open it as cp1251 in the first
place!

C-x RET c cp1251 C-x C-f myfile.txt

It's typically only if you've got a file full of fragments in 
different encodings (horrible mail spool formats and the like) that
you want to decode and reencode particular subregions of whole files.

If you've already opened a file and its encoding is misdetected,
you can also hit 
C-x RET r cp1251 
to "revert" the buffer to the file reopened in the specified encoding.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: decode-coding-string question
  2008-08-18 13:58       ` Ted Zlatanov
  2008-08-18 16:09         ` Nikolaj Schumacher
  2008-08-18 17:45         ` David Golden
@ 2008-08-18 19:11         ` Eli Zaretskii
  2008-08-19  8:34           ` Kevin Rodgers
       [not found]         ` <mailman.16993.1219086714.18990.help-gnu-emacs@gnu.org>
  3 siblings, 1 reply; 15+ messages in thread
From: Eli Zaretskii @ 2008-08-18 19:11 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Ted Zlatanov <tzz@lifelogs.com>
> Date: Mon, 18 Aug 2008 08:58:55 -0500
> 
> How would you recommend decoding text from particular encodings?  Given
> text like the one shown above in a buffer, only decode-coding-region
> seems to DTRT, and it's not interactive.

If you mean interactively, i.e. you visited a buffer and then
discovered that it was decoded incorrectly, and the actual encoding is
different, then "C-x RET c cp1251 RET M-x revert-buffer RET" should do
what you want, I think.

> Context: I have a file full of CP1251 data and don't want to use Perl's
> Encode module because I'm stubborn and think Emacs should handle it :)

What about the rest of the file? is it encoded in some other encoding?
If not, then the above recipe should do.  If it doesn't, please tell
more details.

> On Fri, 15 Aug 2008 20:06:59 +0400 Dmitry Dzhus <dima@sphinx.net.ru> wrote: 
> 
> DD> That was nitpicking somewhat irrelevant to unibyte-multibyte problem:
> DD> Ted expected to get «нуль», and it's «íóëü», though the string he
> DD> originally provided — «íîëü» — decodes to «ноль»; however, both «нуль»
> DD> and «ноль» mean «zero».
> 
> Yes, I was translating from Russian and knew the text said "zero" but
> didn't remember the correct spelling.

AFAIK, both spellings are right.





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: decode-coding-string question
       [not found]         ` <mailman.16993.1219086714.18990.help-gnu-emacs@gnu.org>
@ 2008-08-18 20:07           ` Ted Zlatanov
  2008-08-18 23:01             ` David Golden
  0 siblings, 1 reply; 15+ messages in thread
From: Ted Zlatanov @ 2008-08-18 20:07 UTC (permalink / raw)
  To: help-gnu-emacs

On Mon, 18 Aug 2008 22:11:37 +0300 Eli Zaretskii <eliz@gnu.org> wrote: 

>> From: Ted Zlatanov <tzz@lifelogs.com>
>> Date: Mon, 18 Aug 2008 08:58:55 -0500
>> 
>> How would you recommend decoding text from particular encodings?  Given
>> text like the one shown above in a buffer, only decode-coding-region
>> seems to DTRT, and it's not interactive.

EZ> If you mean interactively, i.e. you visited a buffer and then
EZ> discovered that it was decoded incorrectly, and the actual encoding is
EZ> different, then "C-x RET c cp1251 RET M-x revert-buffer RET" should do
EZ> what you want, I think.

>> Context: I have a file full of CP1251 data and don't want to use Perl's
>> Encode module because I'm stubborn and think Emacs should handle it :)

EZ> What about the rest of the file? is it encoded in some other encoding?
EZ> If not, then the above recipe should do.  If it doesn't, please tell
EZ> more details.

I often have to open mangled files with mixed-up encodings; it's
convenient to set the coding-system after I look at the text, and only
apply it to a region.

On Mon, 18 Aug 2008 18:45:29 +0100 David Golden <david.golden@oceanfree.net> wrote: 
[similar advice]

I see it's an uncommon operation, so I'll use the make-unibyte-string
recipe Eli and others recommended.

Thanks
Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: decode-coding-string question
  2008-08-18 20:07           ` Ted Zlatanov
@ 2008-08-18 23:01             ` David Golden
  2008-08-19 13:48               ` Ted Zlatanov
  0 siblings, 1 reply; 15+ messages in thread
From: David Golden @ 2008-08-18 23:01 UTC (permalink / raw)
  To: help-gnu-emacs

Ted Zlatanov wrote:

 
> I often have to open mangled files with mixed-up encodings; it's
> convenient to set the coding-system after I look at the text, and only
> apply it to a region.
> 

Uhm.

M-x recode-region 

also exists (at least in CVS emacs, no idea when it was introduced).





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: decode-coding-string question
  2008-08-18 19:11         ` Eli Zaretskii
@ 2008-08-19  8:34           ` Kevin Rodgers
  0 siblings, 0 replies; 15+ messages in thread
From: Kevin Rodgers @ 2008-08-19  8:34 UTC (permalink / raw)
  To: help-gnu-emacs

Eli Zaretskii wrote:
>> From: Ted Zlatanov <tzz@lifelogs.com>
>> Date: Mon, 18 Aug 2008 08:58:55 -0500
>>
>> How would you recommend decoding text from particular encodings?  Given
>> text like the one shown above in a buffer, only decode-coding-region
>> seems to DTRT, and it's not interactive.
> 
> If you mean interactively, i.e. you visited a buffer and then
> discovered that it was decoded incorrectly, and the actual encoding is
> different, then "C-x RET c cp1251 RET M-x revert-buffer RET" should do
> what you want, I think.

aka `C-x RET r cp1251 RET', right?

-- 
Kevin Rodgers
Denver, Colorado, USA





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: decode-coding-string question
  2008-08-18 23:01             ` David Golden
@ 2008-08-19 13:48               ` Ted Zlatanov
  0 siblings, 0 replies; 15+ messages in thread
From: Ted Zlatanov @ 2008-08-19 13:48 UTC (permalink / raw)
  To: help-gnu-emacs

On Tue, 19 Aug 2008 00:01:07 +0100 David Golden <david.golden@oceanfree.net> wrote: 

DG> Ted Zlatanov wrote:
>> I often have to open mangled files with mixed-up encodings; it's
>> convenient to set the coding-system after I look at the text, and only
>> apply it to a region.
>> 

DG> M-x recode-region 

DG> also exists (at least in CVS emacs, no idea when it was introduced).

That's exactly what I needed (and it's nicely interactive too).  I did
apropos for 'decod' and didn't see 'recode-*'.

Thanks
Ted


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-08-19 13:48 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-14 21:02 decode-coding-string question Ted Zlatanov
2008-08-14 22:19 ` Dmitry Dzhus
2008-08-15  7:37   ` Eli Zaretskii
2008-08-15 15:54   ` Ted Zlatanov
2008-08-15 17:04     ` Eli Zaretskii
     [not found]     ` <mailman.16821.1218819933.18990.help-gnu-emacs@gnu.org>
2008-08-18 13:58       ` Ted Zlatanov
2008-08-18 16:09         ` Nikolaj Schumacher
2008-08-18 17:45         ` David Golden
2008-08-18 19:11         ` Eli Zaretskii
2008-08-19  8:34           ` Kevin Rodgers
     [not found]         ` <mailman.16993.1219086714.18990.help-gnu-emacs@gnu.org>
2008-08-18 20:07           ` Ted Zlatanov
2008-08-18 23:01             ` David Golden
2008-08-19 13:48               ` Ted Zlatanov
     [not found]   ` <mailman.16762.1218785885.18990.help-gnu-emacs@gnu.org>
2008-08-15 16:06     ` Dmitry Dzhus
2008-08-14 22:20 ` David Golden

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).