Turning HTML character references into something readable?

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* Turning HTML character references into something readable?
@ 2003-04-27  5:20 Karl Eichwalder
  2003-04-27 15:54 ` Benjamin Riefenstahl
  2003-04-28 12:17 ` Colin Marquardt
  0 siblings, 2 replies; 10+ messages in thread
From: Karl Eichwalder @ 2003-04-27  5:20 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=iso-2022-jp-2, Size: 453 bytes --]

Is there already a fuction to turn these HTML character references into
proper characters?  "&#1071;" must come out as the cyrillic "^[$B'A^[(B" etc.

On the command line recode can do the trick:

^[.A^[N    echo "&#1071;" | recode html..utf-8

-- 
                                                         |      ,__o
http://www.gnu.franken.de/ke/                            |    _-\_<,
ke@suse.de (work) / keichwa@gmx.net (home)               |   (*)/'(*)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Turning HTML character references into something readable?
  2003-04-27  5:20 Turning HTML character references into something readable? Karl Eichwalder
@ 2003-04-27 15:54 ` Benjamin Riefenstahl
  2003-04-27 19:09   ` Karl Eichwalder
  2003-04-28 12:17 ` Colin Marquardt
  1 sibling, 1 reply; 10+ messages in thread
From: Benjamin Riefenstahl @ 2003-04-27 15:54 UTC (permalink / raw)

Hi Karl,

Karl Eichwalder <keichwa@gmx.net> writes:
> Is there already a fuction to turn these HTML character references
> into proper characters?  "&#1071;" must come out as the cyrillic
> "??" etc.

Actually that literal seems to be in some JIS encoding on my side,
while &#1071; indicates Unicode.

If you want an ELisp function, M-x apropos RET char RET turns up
functions that can be used to convert a Unicode codepoint to a string
that Emacs shows fine here as cyrillic, e.g.:

  (char-to-string (decode-char 'ucs 1071))

If you want to get this into an interactive command, you'd need some
more coding.  Or maybe PSGML or some other SGML/HTML/XML mode may have
that functionality already.

> On the command line recode can do the trick:
> 
>    echo "&#1071;" | recode html..utf-8

You can use shell-command-on-region (M-|) to use "recode html..utf-8"
directly.  Note that M-| is not recognized by some german keyboard
drivers.  But than for everyday use you may want to encapsulate that
into your own command where the "recode" command is hardcoded anyway.

so long, benny

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Turning HTML character references into something readable?
  2003-04-27 15:54 ` Benjamin Riefenstahl
@ 2003-04-27 19:09   ` Karl Eichwalder
  2003-04-28 13:02     ` Reiner Steib
  2003-04-28 17:12     ` Benjamin Riefenstahl
  0 siblings, 2 replies; 10+ messages in thread
From: Karl Eichwalder @ 2003-04-27 19:09 UTC (permalink / raw)


Benjamin Riefenstahl <Benjamin.Riefenstahl@epost.de> writes:

> Actually that literal seems to be in some JIS encoding on my side,
> while &#1071; indicates Unicode.

Gnus decided to turn it into JIS; initially it was Unicode/UTF-8.

>   (char-to-string (decode-char 'ucs 1071))

Yes, this is a good hint!

> If you want to get this into an interactive command, you'd need some
> more coding.  Or maybe PSGML or some other SGML/HTML/XML mode may have
> that functionality already.

In this case I cannot use PSGML because de.wikipedia.org is based on a
free style markup language...

>> On the command line recode can do the trick:
>> 
>>    echo "&#1071;" | recode html..utf-8
>
> You can use shell-command-on-region (M-|) to use "recode html..utf-8"
> directly.

I completely forgot about this possibility.  But now it turns out,
"recode html..utf-8" is too ambitious; if the file already contains
umlaut characters they will be encoded twice:

    echo "Danke schön &#1070;&#1071;" | recode html..utf-8
    Danke schÃ¶n ��

I must find a way to tell recode to leave "Danke schön" untouched.

-- 
                                                         |      ,__o
http://www.gnu.franken.de/ke/                            |    _-\_<,
ke@suse.de (work) / keichwa@gmx.net (home)               |   (*)/'(*)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Turning HTML character references into something readable?
  2003-04-27  5:20 Turning HTML character references into something readable? Karl Eichwalder
  2003-04-27 15:54 ` Benjamin Riefenstahl
@ 2003-04-28 12:17 ` Colin Marquardt
  1 sibling, 0 replies; 10+ messages in thread
From: Colin Marquardt @ 2003-04-28 12:17 UTC (permalink / raw)


Karl Eichwalder <keichwa@gmx.net> writes:

> Is there already a fuction to turn these HTML character references into
> proper characters?  "&#1071;" must come out as the cyrillic "Я" etc.

Does x-symbol help here?: http://x-symbol.sourceforge.net/

Cheers,
  Colin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Turning HTML character references into something readable?
  2003-04-27 19:09   ` Karl Eichwalder
@ 2003-04-28 13:02     ` Reiner Steib
  2003-04-28 15:20       ` Kai Großjohann
  2003-04-28 17:12     ` Benjamin Riefenstahl
  1 sibling, 1 reply; 10+ messages in thread
From: Reiner Steib @ 2003-04-28 13:02 UTC (permalink / raw)


On Sun, Apr 27 2003, Karl Eichwalder wrote:

> Benjamin Riefenstahl <Benjamin.Riefenstahl@epost.de> writes:
>
>> Actually that literal seems to be in some JIS encoding on my side,

Same here:

,----[ `C-u C-x =' ]
|   character: [ removed "mirrored `R'" ] (0151701, 54209, 0xd3c1)
|     charset: japanese-jisx0208 (JISX0208.1983/1990 Japanese Kanji: ISO-IR-87)
|  code point: 39 65
|      syntax: word
|    category: Y:Cyrillic characters of 2-byte character sets   j:Japanese  
| 	     |:While filling, we can break a line at this character.  
| buffer code: 0x92 0xA7 0xC1
|   file code: ESC 24 42 27 41 (encoded by coding system iso-2022-jp-2)
|        font: -Misc-Fixed-Medium-R-Normal--14-130-75-75-C-140-JISX0208.1983-0
`----

What does `C-u C-x =' say on that character before sending?

>> while &#1071; indicates Unicode.
>
> Gnus decided to turn it into JIS; initially it was Unicode/UTF-8.

I don't think that Gnus is able to convert UTF-8 to JIS.  Running
`find-coding-systems-region' in your message shows that Emacs 21.3
doesn't list any UTF coding-system.  This is basically what Gnus does
in the function `mm-find-mime-charset-region' in `mm-util.el'.

>>   (char-to-string (decode-char 'ucs 1071))

When I insert this char into the buffer...

  (insert (char-to-string (decode-char 'ucs 1071))); Я

... and use...

  (setq mm-coding-system-priorities nil) ;; default

... I get iso-8859-5.

With my setting of...

  (setq mm-coding-system-priorities '(iso-latin-1 iso-latin-9 mule-utf-8))

... I get utf-8.

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo--- PGP key available via WWW   http://rsteib.home.pages.de/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Turning HTML character references into something readable?
  2003-04-28 13:02     ` Reiner Steib
@ 2003-04-28 15:20       ` Kai Großjohann
  2003-04-28 18:01         ` Reiner Steib
  0 siblings, 1 reply; 10+ messages in thread
From: Kai Großjohann @ 2003-04-28 15:20 UTC (permalink / raw)


Reiner Steib <4.uce.03.r.s@nurfuerspam.de> writes:

> I don't think that Gnus is able to convert UTF-8 to JIS.

Does Emacs 21.3 include utf-translate-cjk already?  That might be
able to do it.
-- 
file-error; Data: (Opening input file no such file or directory ~/.signature)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Turning HTML character references into something readable?
  2003-04-27 19:09   ` Karl Eichwalder
  2003-04-28 13:02     ` Reiner Steib
@ 2003-04-28 17:12     ` Benjamin Riefenstahl
  2003-04-29  3:55       ` Karl Eichwalder
  1 sibling, 1 reply; 10+ messages in thread
From: Benjamin Riefenstahl @ 2003-04-28 17:12 UTC (permalink / raw)


Hi Karl,


Karl Eichwalder <keichwa@gmx.net> writes:
>     echo "Danke schön &#1070;&#1071;" | recode html..utf-8
>     Danke schÃ¶n ��
> 
> I must find a way to tell recode to leave "Danke schön" untouched.

The recode tool probably assumes that HTML text is in iso-8859-1
encoding, while your file seems to be UTF-8 already.  Or the text is
passed to the tool as UTF-8 by Emacs because of the
process-coding-system used (see process-coding-system-alist and it's
docs).


so long, benny

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Turning HTML character references into something readable?
  2003-04-28 15:20       ` Kai Großjohann
@ 2003-04-28 18:01         ` Reiner Steib
  2003-04-29 13:53           ` Kai Großjohann
  0 siblings, 1 reply; 10+ messages in thread
From: Reiner Steib @ 2003-04-28 18:01 UTC (permalink / raw)


On Mon, Apr 28 2003, Kai Großjohann wrote:

> Reiner Steib <4.uce.03.r.s@nurfuerspam.de> writes:
>
>> I don't think that Gnus is able to convert UTF-8 to JIS.
>
> Does Emacs 21.3 include utf-translate-cjk already?  

No.

> That might be able to do it.

Not (AFAICS) in CVS HEAD (from half an hour ago):

- (setq utf-translate-cjk t)

- C&P the char from <news:shr87oa53h.fsf@tux.gnu.franken.de> into a
  buffer.

- Save buffer, answer utf-8 on prompt for coding.

- Open the file (with utf-8 coding), I see:

,----[ M-x describe-char RET ]
|   character: [a rectangle] (01175275, 326333, 0x4fabd)
|     charset: mule-unicode-e000-ffff
| 	     (Unicode characters of the range U+E000..U+FFFF.)
|  code point: 117 61
|      syntax: w 	which means: word
|    category:
| buffer code: 0x9C 0xF3 0xF5 0xBD
|   file code: 0xEF 0xBF 0xBD (encoded by coding system mule-utf-8-unix)
|     Unicode: FFFD
|        font: -Misc-Fixed-Medium-R-Normal--14-130-75-75-C-70-ISO10646-1
`----

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo--- PGP key available via WWW   http://rsteib.home.pages.de/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Turning HTML character references into something readable?
  2003-04-28 17:12     ` Benjamin Riefenstahl
@ 2003-04-29  3:55       ` Karl Eichwalder
  0 siblings, 0 replies; 10+ messages in thread
From: Karl Eichwalder @ 2003-04-29  3:55 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=iso-2022-jp-2, Size: 666 bytes --]

Benjamin Riefenstahl <Benjamin.Riefenstahl@epost.de> writes:

> The recode tool probably assumes that HTML text is in iso-8859-1
> encoding, while your file seems to be UTF-8 already.

Probably.  Reading the recode manual I came up with this command line
solution:

    echo "Danke sch^[.A^[Nvn &#1070;&#1071;" \
      | recode -d ..html | recode html..utf-8
    =>
    Danke sch^[.A^[Nvn ^[$B'@'A^[(B

Writing a proper Emacs command will be the next step :)

-- 
                                                         |      ,__o
http://www.gnu.franken.de/ke/                            |    _-\_<,
ke@suse.de (work) / keichwa@gmx.net (home)               |   (*)/'(*)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Turning HTML character references into something readable?
  2003-04-28 18:01         ` Reiner Steib
@ 2003-04-29 13:53           ` Kai Großjohann
  0 siblings, 0 replies; 10+ messages in thread
From: Kai Großjohann @ 2003-04-29 13:53 UTC (permalink / raw)


Reiner Steib <4.uce.03.r.s@nurfuerspam.de> writes:

> Not (AFAICS) in CVS HEAD (from half an hour ago):
>
> - (setq utf-translate-cjk t)

You always had to use customize to set the variable.

But now the variable exists no more, and instead we have the minor
mode utf-translate-cjk-mode.

-- 
file-error; Data: (Opening input file no such file or directory ~/.signature)

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2003-04-29 13:53 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-04-27  5:20 Turning HTML character references into something readable? Karl Eichwalder
2003-04-27 15:54 ` Benjamin Riefenstahl
2003-04-27 19:09   ` Karl Eichwalder
2003-04-28 13:02     ` Reiner Steib
2003-04-28 15:20       ` Kai Großjohann
2003-04-28 18:01         ` Reiner Steib
2003-04-29 13:53           ` Kai Großjohann
2003-04-28 17:12     ` Benjamin Riefenstahl
2003-04-29  3:55       ` Karl Eichwalder
2003-04-28 12:17 ` Colin Marquardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).