recoding a buffer coding system

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* recoding a buffer coding system
@ 2009-08-14 21:31 Santiago Mejia
  2009-08-15  6:36 ` Eli Zaretskii
  2009-08-15  8:26 ` Peter Dyballa
  0 siblings, 2 replies; 13+ messages in thread
From: Santiago Mejia @ 2009-08-14 21:31 UTC (permalink / raw)
  To: help-gnu-emacs

I am trying to make a script that downloads a webpage and reformats it
into enriched-mode.

I have managed to successfully reformat the raw html tags into
enriched-mode tags.  However, when I try to display the buffer in
enriched-mode (by using (format-decode-buffer 'text/enriched)), some of
the non-ASCII characters get screwed up (German umlauts, to be precise).  

I have managed to solve the issue through a nasty trick: saving the
file, killing the buffer, and reopening the file.  But this is a trick
that I would like to avoid.  

Any ideas as to how to do it?

Santiago.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: recoding a buffer coding system
  2009-08-14 21:31 recoding a buffer coding system Santiago Mejia
@ 2009-08-15  6:36 ` Eli Zaretskii
  2009-08-15 14:31   ` Santiago Mejia
  2009-08-15  8:26 ` Peter Dyballa
  1 sibling, 1 reply; 13+ messages in thread
From: Eli Zaretskii @ 2009-08-15  6:36 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Santiago Mejia <mejia@uchicago.edu>
> Date: Fri, 14 Aug 2009 16:31:46 -0500
> 
> 
> I have managed to successfully reformat the raw html tags into
> enriched-mode tags.  However, when I try to display the buffer in
> enriched-mode (by using (format-decode-buffer 'text/enriched)), some of
> the non-ASCII characters get screwed up (German umlauts, to be precise).  

Get screwed how, exactly?

> I have managed to solve the issue through a nasty trick: saving the
> file, killing the buffer, and reopening the file.  But this is a trick
> that I would like to avoid.  

You didn't say how reopening the file helps you avoid the problem.
Knowing that might suggest a way for how to avoid the trick.

Also, can you post a minimal file and some minimal code to reproduce
the problem?  There could be a bug somewhere.

Finally, what version of Emacs is that?




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: recoding a buffer coding system
  2009-08-14 21:31 recoding a buffer coding system Santiago Mejia
  2009-08-15  6:36 ` Eli Zaretskii
@ 2009-08-15  8:26 ` Peter Dyballa
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Dyballa @ 2009-08-15  8:26 UTC (permalink / raw)
  To: Santiago Mejia; +Cc: help-gnu-emacs

Am 14.08.2009 um 23:31 schrieb Santiago Mejia:

> I have managed to successfully reformat the raw html tags into
> enriched-mode tags.  However, when I try to display the buffer in
> enriched-mode (by using (format-decode-buffer 'text/enriched)),  
> some of
> the non-ASCII characters get screwed up (German umlauts, to be  
> precise).

How? What do they look like/how are they presented to you? Have you  
looked at the mode-line and which encoding it displays for the  
buffers? Does the downloaded file have a header in which an encoding  
is listed?

Your eMail user agent runs in GNU Emacs 22.1 – are you using it also  
for the purpose of downloading and reformatting HTML? GNU Emacs 23.1,  
the Unicode Emacs, might give better results...

--
Greetings

   Pete

By filing this bug report you have challenged the honor of my family.  
Prepare to die!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: recoding a buffer coding system
  2009-08-15  6:36 ` Eli Zaretskii
@ 2009-08-15 14:31   ` Santiago Mejia
  2009-08-15 15:15     ` Peter Dyballa
  2009-08-15 15:24     ` Eli Zaretskii
  0 siblings, 2 replies; 13+ messages in thread
From: Santiago Mejia @ 2009-08-15 14:31 UTC (permalink / raw)
  To: help-gnu-emacs

Eli Zaretskii <eliz@gnu.org> writes:

> Also, can you post a minimal file and some minimal code to reproduce
> the problem?  There could be a bug somewhere.

Sorry for the lack of detail.

The method I am using to download the page is:

(switch-to-buffer (url-retrieve-synchronously "http://www.wordreference.com/deen/grun"))

In the buffer *http www:wordreference.com:80* I see the character that
firefox displays as "ü" (u with umlaut) as \303\274.  When I try to copy
and paste it here in this e-mail, however, it appears as: "Ã¼" (that is
also what happens when I try returning this buffer (buffer-string) and
inserting the returned buffer it into another buffer).  

As I said, however, if I merely save and reopen the file, the characters
get shown properly.

In case this is useful, in the buffer *http www:wordreference.com:80*
the variable 'buffer-file-coding-system' is mule-utf-8

And yes: I am using Emacs 22.1.1.

Santiago.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: recoding a buffer coding system
  2009-08-15 14:31   ` Santiago Mejia
@ 2009-08-15 15:15     ` Peter Dyballa
  2009-08-16  2:29       ` Santiago Mejia
  2009-08-15 15:24     ` Eli Zaretskii
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Dyballa @ 2009-08-15 15:15 UTC (permalink / raw)
  To: Santiago Mejia; +Cc: help-gnu-emacs

Am 15.08.2009 um 16:31 schrieb Santiago Mejia:

> In the buffer *http www:wordreference.com:80* I see the character that
> firefox displays as "ü" (u with umlaut) as \303\274.

LATIN SMALL LETTER U WITH DIAERESIS is U+00FC. It is saved as C3 BC  
(hex) or \303 \274. So you get a correct byte representation.

>   When I try to copy
> and paste it here in this e-mail, however, it appears as: "Ã¼"

Because LATIN CAPITAL LETTER A WITH TILDE is U+00BC and VULGAR  
FRACTION ONE QUARTER is U+00BC and these two bytes are presented as  
if belonging into some ISO Latin encoding.

>
> As I said, however, if I merely save and reopen the file, the  
> characters
> get shown properly.

Yes, GNU Emacs now interprets the two bytes as one Unicode character.

>
> In case this is useful, in the buffer *http www:wordreference.com:80*
> the variable 'buffer-file-coding-system' is mule-utf-8
>

In the end? When you re-open a second time?

The problem probably is that url-retrieve-synchronously fetches a  
byte stream which is fed into a 7-bit (?) encoding buffer, so Unicode  
encoded characters end up as two (or more) bytes which are display in  
octal because their character codes are inappropriate for this encoding.

Me, working in GNU Emacs 23.1.50 and 22.3, see no octal codes, I only  
see the bytes from the UTF-8 encoded umlauts etc. according to HTML  
property "charset=utf-8." The buffer is in actual no encoding at all,  
and so you're lucky that it's contents is saved as UTF-8! Therefore  
no information is lost and obviously GNU Emacs uses the proper  
encoding when it opens the *file* now.

Maybe using

	(modify-coding-system-alist 'process "<some thing>"   'utf-8)

makes GNU Emacs handle the buffer, associated with no file and with  
no process, more like it should...  I haven't found the proper setting!

--
Greetings

   Pete

Time is an illusion. Lunchtime, doubly so.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: recoding a buffer coding system
  2009-08-15 14:31   ` Santiago Mejia
  2009-08-15 15:15     ` Peter Dyballa
@ 2009-08-15 15:24     ` Eli Zaretskii
  2009-08-16  2:33       ` Santiago Mejia
  1 sibling, 1 reply; 13+ messages in thread
From: Eli Zaretskii @ 2009-08-15 15:24 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Santiago Mejia <mejia@uchicago.edu>
> Date: Sat, 15 Aug 2009 09:31:40 -0500
> 
> (switch-to-buffer (url-retrieve-synchronously "http://www.wordreference.com/deen/grun"))
> 
> In the buffer *http www:wordreference.com:80* I see the character that
> firefox displays as "ü" (u with umlaut) as \303\274.

\303\274 is the UTF-8 representation of ü.  I'm guessing that the
buffer where it is displayed as \303\274 is a unibyte buffer.

> As I said, however, if I merely save and reopen the file, the characters
> get shown properly.

Does it help to say "M-: (set-buffer-multibyte t) RET"?





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: recoding a buffer coding system
  2009-08-15 15:15     ` Peter Dyballa
@ 2009-08-16  2:29       ` Santiago Mejia
  2009-08-16  2:55         ` Peter Dyballa
  0 siblings, 1 reply; 13+ messages in thread
From: Santiago Mejia @ 2009-08-16  2:29 UTC (permalink / raw)
  To: help-gnu-emacs


>> In the buffer *http www:wordreference.com:80* I see the character that
>> firefox displays as "ü" (u with umlaut) as \303\274.
>>
>>   When I try to copy
>> and paste it here in this e-mail, however, it appears as: "Ã¼"
>>
>> As I said, however, if I merely save and reopen the file, the
>> characters get shown properly.

> Me, working in GNU Emacs 23.1.50 and 22.3, see no octal codes, I only
> see the bytes from the UTF-8 encoded umlauts etc. according to HTML
> property "charset=utf-8." The buffer is in actual no encoding at all,
> and so you're lucky that it's contents is saved as UTF-8! Therefore
> no information is lost and obviously GNU Emacs uses the proper
> encoding when it opens the *file* now.

This is strange.  I just installed emacs 23.0.60.1 (the emacs23 that
comes with Ubuntu --called emacs-snapshot) and I find the same exact
result: I still see the same \303\274 character for ü when I call:

(switch-to-buffer (url-retrieve-synchronously "http://www.wordreference.com/deen/grun"))

>> In case this is useful, in the buffer *http www:wordreference.com:80*
>> the variable 'buffer-file-coding-system' is mule-utf-8
>>
>
> In the end? When you re-open a second time?

No.  In the beginning, before saving (Actually, I save and re-open the file with a
different name).   When I re-open the file, buffer-file-coding-system is
utf-8-unix.

> Maybe using
>
>       (modify-coding-system-alist 'process "<some thing>"   'utf-8)
>
> makes GNU Emacs handle the buffer, associated with no file and with
> no process, more like it should...  I haven't found the proper
> setting!
>

I will try to use your suggestion, but this will entail going through
the documentation and try to understand it.  This weekend,
unfortunately, I will not have the time to do so.

Any further help is appreciated.

Santiago.





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: recoding a buffer coding system
  2009-08-15 15:24     ` Eli Zaretskii
@ 2009-08-16  2:33       ` Santiago Mejia
  0 siblings, 0 replies; 13+ messages in thread
From: Santiago Mejia @ 2009-08-16  2:33 UTC (permalink / raw)
  To: help-gnu-emacs


>> As I said, however, if I merely save and reopen the file, the characters
>> get shown properly.
>
> Does it help to say "M-: (set-buffer-multibyte t) RET"?

No.  Nothing happen when I call this function.

Any further ideas?






^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: recoding a buffer coding system
  2009-08-16  2:29       ` Santiago Mejia
@ 2009-08-16  2:55         ` Peter Dyballa
  2009-08-16  3:17           ` Eli Zaretskii
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Dyballa @ 2009-08-16  2:55 UTC (permalink / raw)
  To: Santiago Mejia; +Cc: help-gnu-emacs


Am 16.08.2009 um 04:29 schrieb Santiago Mejia:

> I just installed emacs 23.0.60.1 (the emacs23 that
> comes with Ubuntu --called emacs-snapshot) and I find the same exact
> result: I still see the same \303\274 character for ü when I call:


Me too, in GNU Emacs 23.1.50. Maybe the function comes from a world  
of 7-bit US ASCII only...

--
Mit friedvollen Grüßen

   Pete

Competition is the great eroder of profits.





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: recoding a buffer coding system
  2009-08-16  2:55         ` Peter Dyballa
@ 2009-08-16  3:17           ` Eli Zaretskii
  2009-08-16 13:49             ` Santiago Mejia
  2009-08-16 21:09             ` Peter Dyballa
  0 siblings, 2 replies; 13+ messages in thread
From: Eli Zaretskii @ 2009-08-16  3:17 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Peter Dyballa <Peter_Dyballa@Web.DE>
> Date: Sun, 16 Aug 2009 04:55:48 +0200
> Cc: help-gnu-emacs@gnu.org
> 
> 
> Am 16.08.2009 um 04:29 schrieb Santiago Mejia:
> 
> > I just installed emacs 23.0.60.1 (the emacs23 that
> > comes with Ubuntu --called emacs-snapshot) and I find the same exact
> > result: I still see the same \303\274 character for ü when I call:
> 
> 
> Me too, in GNU Emacs 23.1.50. Maybe the function comes from a world  
> of 7-bit US ASCII only...

Sounds like a bug that should be reported.





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: recoding a buffer coding system
  2009-08-16  3:17           ` Eli Zaretskii
@ 2009-08-16 13:49             ` Santiago Mejia
  2009-08-16 17:06               ` Eli Zaretskii
  2009-08-16 21:09             ` Peter Dyballa
  1 sibling, 1 reply; 13+ messages in thread
From: Santiago Mejia @ 2009-08-16 13:49 UTC (permalink / raw)
  To: help-gnu-emacs

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Peter Dyballa <Peter_Dyballa@Web.DE>
>> Date: Sun, 16 Aug 2009 04:55:48 +0200
>> Cc: help-gnu-emacs@gnu.org
>> 
>> 
>> Am 16.08.2009 um 04:29 schrieb Santiago Mejia:
>> 
>> > I just installed emacs 23.0.60.1 (the emacs23 that
>> > comes with Ubuntu --called emacs-snapshot) and I find the same exact
>> > result: I still see the same \303\274 character for ü when I call:
>> 
>> 
>> Me too, in GNU Emacs 23.1.50. Maybe the function comes from a world  
>> of 7-bit US ASCII only...
>
> Sounds like a bug that should be reported.

Probably there is a bug... however, there is something that emacs is
doing right in the process of writing and re-opening the file.

I tried debugging my program, by going step by step through the
(write-file "foo") and (insert-file-contents "foo") functions, to see if
I could figure out where was the conjuring trick done.  However, I did
not quite found it (that is why I appealed to the list).

Any ideas as to what should I look for in debugging these functions?
(perhaps what are the likely functions that emacs is using that I could
hack from emacs itself, so as not to have to save and re-open?)

Is my best bet to look at the (write-file "foo") or at
(insert-file-contents "foo)?

S.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: recoding a buffer coding system
  2009-08-16 13:49             ` Santiago Mejia
@ 2009-08-16 17:06               ` Eli Zaretskii
  0 siblings, 0 replies; 13+ messages in thread
From: Eli Zaretskii @ 2009-08-16 17:06 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Santiago Mejia <mejia@uchicago.edu>
> Date: Sun, 16 Aug 2009 08:49:25 -0500
> 
> I tried debugging my program, by going step by step through the
> (write-file "foo") and (insert-file-contents "foo") functions, to see if
> I could figure out where was the conjuring trick done.  However, I did
> not quite found it (that is why I appealed to the list).

I'm quite sure it works because insert-file-contents decodes the UTF-8
sequences into Unicode characters.  That's not a bug, but the correct
behavior.

The bug seems to be in url-retrieve-synchronously, so you might as
well stop looking at write-file and insert-file-contents.




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: recoding a buffer coding system
  2009-08-16  3:17           ` Eli Zaretskii
  2009-08-16 13:49             ` Santiago Mejia
@ 2009-08-16 21:09             ` Peter Dyballa
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Dyballa @ 2009-08-16 21:09 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: help-gnu-emacs

Am 16.08.2009 um 05:17 schrieb Eli Zaretskii:

> Sounds like a bug that should be reported.

I think there is no bug in url-retrieve-synchronously! This function  
needs to be kind of universal, i.e., don't assume or set anything.  
 From the internet one can download anything, 7-bit US-ASCII, 8-bit  
umlauts, Unicodes – and real binary data (PDF, JPEG, MPEG,...). It  
would be nice if this function would accept another argument, the  
encoding for the buffer created. Right now the user has to take care  
of this, because the user knows what kind of "data" will be (or  
already was) downloaded. The variables save-buffer-coding-system or  
buffer-file-coding-system determine how the buffer will be saved in a  
file. And this should suffice...

--
Greetings

   Pete

Theory and practice are the same, in theory, but, in practice, they  
are different.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-08-16 21:09 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-14 21:31 recoding a buffer coding system Santiago Mejia
2009-08-15  6:36 ` Eli Zaretskii
2009-08-15 14:31   ` Santiago Mejia
2009-08-15 15:15     ` Peter Dyballa
2009-08-16  2:29       ` Santiago Mejia
2009-08-16  2:55         ` Peter Dyballa
2009-08-16  3:17           ` Eli Zaretskii
2009-08-16 13:49             ` Santiago Mejia
2009-08-16 17:06               ` Eli Zaretskii
2009-08-16 21:09             ` Peter Dyballa
2009-08-15 15:24     ` Eli Zaretskii
2009-08-16  2:33       ` Santiago Mejia
2009-08-15  8:26 ` Peter Dyballa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).