unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* making "玄奘" say "Xuanzang" in chinese
@ 2005-03-23  3:14 Joe Corneli
  2005-03-23 13:08 ` Mark Plaksin
  0 siblings, 1 reply; 7+ messages in thread
From: Joe Corneli @ 2005-03-23  3:14 UTC (permalink / raw)


I'm getting text from the web, and part of what I'm getting are
strings of numbers that denote chinese characters.  I know
emacs can display chinese character, because I see them all the
time... but how can I translate these strings?

The actual page I'm looking at is...

http://en.wikipedia.org/wiki/Xuan_Zang_(fictional_character)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: making "玄奘" say "Xuanzang" in chinese
  2005-03-23  3:14 making "玄奘" say "Xuanzang" in chinese Joe Corneli
@ 2005-03-23 13:08 ` Mark Plaksin
  2005-03-23 15:56   ` Joe Corneli
                     ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Mark Plaksin @ 2005-03-23 13:08 UTC (permalink / raw)


Joe Corneli <jcorneli@math.utexas.edu> writes:

> I'm getting text from the web, and part of what I'm getting are
> strings of numbers that denote chinese characters.  I know
> emacs can display chinese character, because I see them all the
> time... but how can I translate these strings?
>
> The actual page I'm looking at is...
>
> http://en.wikipedia.org/wiki/Xuan_Zang_(fictional_character)

Hmm, it just works for me.  I'm using CVS Emacs (actually the multi-tty
branch), emacs-w3m, and Debian unstable with a zillion fonts installed.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: making "&#29572;&#22872;" say "Xuanzang" in chinese
  2005-03-23 13:08 ` Mark Plaksin
@ 2005-03-23 15:56   ` Joe Corneli
  2005-03-23 16:29   ` Joe Corneli
       [not found]   ` <mailman.20.1111596532.28103.help-gnu-emacs@gnu.org>
  2 siblings, 0 replies; 7+ messages in thread
From: Joe Corneli @ 2005-03-23 15:56 UTC (permalink / raw)


   Joe Corneli <jcorneli@math.utexas.edu> writes:

   > I'm getting text from the web, and part of what I'm getting are
   > strings of numbers that denote chinese characters.  I know
   > emacs can display chinese character, because I see them all the
   > time... but how can I translate these strings?
   >
   > The actual page I'm looking at is...
   >
   > http://en.wikipedia.org/wiki/Xuan_Zang_(fictional_character)

   Hmm, it just works for me.  I'm using CVS Emacs (actually the multi-tty
   branch), emacs-w3m, and Debian unstable with a zillion fonts installed.

Presumably there is some code in emacs-w3m that does the translation -
I'll look for that.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: making "&#29572;&#22872;" say "Xuanzang" in chinese
  2005-03-23 13:08 ` Mark Plaksin
  2005-03-23 15:56   ` Joe Corneli
@ 2005-03-23 16:29   ` Joe Corneli
       [not found]   ` <mailman.20.1111596532.28103.help-gnu-emacs@gnu.org>
  2 siblings, 0 replies; 7+ messages in thread
From: Joe Corneli @ 2005-03-23 16:29 UTC (permalink / raw)


Adapted from w3m-filter.el:

(while (re-search-forward "&#\\([0-9]+\\);" nil t)
  (setq ucs (string-to-number (match-string 1)))
  (delete-region (match-beginning 0) (match-end 0))
  (insert-char ucs 1))

This would appear to work if the characters themselves were recognized...

But when I run this expression on a buffer containing the string
"&#29572;&#22872;" what I get is an error, like this:

Debugger entered--Lisp error: (error "Invalid character: 071604, 29572, 0x7384")
  insert-char(29572 1)

What do I have to do to make emacs know to treat these things as characters?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: making "&#29572;&#22872;" say "Xuanzang" in chinese
       [not found]   ` <mailman.20.1111596532.28103.help-gnu-emacs@gnu.org>
@ 2005-03-24 20:07     ` Miles Bader
  2005-03-25  1:07       ` Joe Corneli
  0 siblings, 1 reply; 7+ messages in thread
From: Miles Bader @ 2005-03-24 20:07 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=iso-2022-jp-2, Size: 2041 bytes --]

Joe Corneli <jcorneli@math.utexas.edu> writes:
> Adapted from w3m-filter.el:
>
> (while (re-search-forward "&#\\([0-9]+\\);" nil t)
>   (setq ucs (string-to-number (match-string 1)))
>   (delete-region (match-beginning 0) (match-end 0))
>   (insert-char ucs 1))
>
> This would appear to work if the characters themselves were recognized...
>
> But when I run this expression on a buffer containing the string
> "&#29572;&#22872;" what I get is an error, like this:

Is that really what w3m does?  I'm not sure how the above could possibly
work in any normal version of Emacs -- the argument to `insert-char' is
an Emacs characater, not a unicode code-point.  So, you need to
translate from the unicode code-point to the Emacs character encoding.

One method might be to translate the unicode code-point into a utf-16
string (should be trivial I guess), and then use `decode-coding-string'
to translate that into Emacs' internal encoding; e.g.:


 (while (re-search-forward "&#\\([0-9]+\\);" nil t)
   (let* ((ucs (string-to-number (match-string 1)))
          (ucs-string (string (logand ucs #xFF) (logand (ash ucs -8) #xFF)))
          (decoded-string (decode-coding-string ucs-string 'mule-utf-16le)))
     (delete-region (match-beginning 0) (match-end 0))
     (insert decoded-string)))


For me, this does the right thing on your example, and on the text of
that wikipedia page:

   The fictional character Xuanzang (^[$B8<Ty^[(B, WG:  Hs^[.A^[N|an-tsang), a central
   character of the classic Chinese novel Journey to the West ...


It probably will only work well in recent CVS versions of Emacs
that have `utf-translate-cjk-mode' turned on by default though. [*]

-Miles


[*] In the current CVS Emacs, there seems to be a function that does
    this translation directly too, `utf-lookup-subst-table-for-decode'
    but given the odd name, it's probably not intended for general
    use...

-- 
Love is a snowmobile racing across the tundra.  Suddenly it flips over,
pinning you underneath.  At night the ice weasels come.  --Nietzsche

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: making "&#29572;&#22872;" say "Xuanzang" in chinese
  2005-03-24 20:07     ` Miles Bader
@ 2005-03-25  1:07       ` Joe Corneli
  2005-03-25  2:16         ` Miles Bader
  0 siblings, 1 reply; 7+ messages in thread
From: Joe Corneli @ 2005-03-25  1:07 UTC (permalink / raw)



   Joe Corneli <jcorneli@math.utexas.edu> writes:
   > Adapted from w3m-filter.el:
   >
   > (while (re-search-forward "&#\\([0-9]+\\);" nil t)
   >   (setq ucs (string-to-number (match-string 1)))
   >   (delete-region (match-beginning 0) (match-end 0))
   >   (insert-char ucs 1))
   >
   > This would appear to work if the characters themselves were recognized...
   >
   > But when I run this expression on a buffer containing the string
   > "&#29572;&#22872;" what I get is an error, like this:

   Is that really what w3m does?

Hm... well I did doctor it up a bit.  In particular, I took out some
code that wrapped `ucs' in the last line with the function defined by:

 (defun w3m-ucs-to-char (codepoint)
   (or (decode-char 'ucs codepoint) ?~))

But keeping the function around wasn't helping either.  Except, when I
tried it again, it worked, so I must have gotten something wrong.

This code seems a little more readable than the code you
supplied...  but they seem to have the same effect.

Anyway, your advice got me past whatever I was stumbling over.

Can you suggest something that will work on this content from the
gnu.org homepage?  Neither the w3m code nor your code seems to produce
human readable output on this stuff (maybe I'm missing some fonts or
something?).  I get a bunch of control-at characters... (oh yeah,
after modifying the "[0-9]" to be ".....".

  [ Az@rbaycanca | Bahasa Indonesia | Bosanski | Catal`
  | &#x7b80;&#x4f53;&#x4e2d;&#x6587; |
  &#x7e41;&#x9ad4;&#x4e2d;&#x6587; | Cesky | Dansk |
  Deutsch | English | Ellynika' | Espaqol | Frangais
  | Hrvatski | Italiano | E+B+R+J+T+ |
  &#x65e5;&#x672c;&#x8a9e; | &#xd55c;&#xad6d;&#xc5b4; |
  Magyar | Nederlands | Norsk | Polski | Portugujs |
  Rombna | Russkij | Srpski | Shqip | Suomi |
  Svenska | Tagalog |
  &#x0e20;&#x0e32;&#x0e29;&#x0e32;&#x0e44;&#x0e17;&#x0e22; |
  T|rkge | Tie>'ng Vie>-.t | Ukrayins'ka ]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: making "&#29572;&#22872;" say "Xuanzang" in chinese
  2005-03-25  1:07       ` Joe Corneli
@ 2005-03-25  2:16         ` Miles Bader
  0 siblings, 0 replies; 7+ messages in thread
From: Miles Bader @ 2005-03-25  2:16 UTC (permalink / raw)


Joe Corneli <jcorneli <at> math.utexas.edu> writes:
>  (defun w3m-ucs-to-char (codepoint)
>    (or (decode-char 'ucs codepoint) ?~))
> 
> But keeping the function around wasn't helping either.  Except, when I
> tried it again, it worked, so I must have gotten something wrong.
> 
> This code seems a little more readable than the code you
> supplied...  but they seem to have the same effect.

Hmmm, I missed that; yeah, `decode-char' does look much nicer ... :-)

> Can you suggest something that will work on this content from the
> gnu.org homepage?  Neither the w3m code nor your code seems to produce
> human readable output on this stuff (maybe I'm missing some fonts or
> something?).  I get a bunch of control-at characters... (oh yeah,
> after modifying the "[0-9]" to be ".....".
> 
>   [ Az <at> rbaycanca | Bahasa Indonesia | Bosanski | Catal`
>   | &#x7b80;&#x4f53;&#x4e2d;&#x6587; |
>   &#x7e41;&#x9ad4;&#x4e2d;&#x6587; | Cesky | Dansk |

Presumably the "x" following &# means "hex", so you should use the BASE argument
to string-to-number if you see it.

The following tweak to your original code seems to generate reasonable output:

  (while (re-search-forward "&#\\(x\\)?\\([0-9a-f]+\\);" nil t)
    (let ((ucs (string-to-number (match-string 2)
                                 (if (match-beginning 1) 16 10))))
    (delete-region (match-beginning 0) (match-end 0))
    (insert-char (decode-char 'ucs ucs) 1)))

[The trick to select decimal or hex works because `match-beginning' returns nil
for optional parenthesized expressions which didn't match.]

-Miles

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-03-25  2:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-23  3:14 making "&#29572;&#22872;" say "Xuanzang" in chinese Joe Corneli
2005-03-23 13:08 ` Mark Plaksin
2005-03-23 15:56   ` Joe Corneli
2005-03-23 16:29   ` Joe Corneli
     [not found]   ` <mailman.20.1111596532.28103.help-gnu-emacs@gnu.org>
2005-03-24 20:07     ` Miles Bader
2005-03-25  1:07       ` Joe Corneli
2005-03-25  2:16         ` Miles Bader

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).