chinese word mode

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* chinese word mode
@ 2013-11-05  9:11 Eric Abrahamsen
  2013-11-06  6:59 ` Eric Abrahamsen
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Abrahamsen @ 2013-11-05  9:11 UTC (permalink / raw)
  To: emacs-devel

So the follow-up to my earlier message is that I'm trying to create a
chinese-word-mode, which will behave (almost exactly) like the existing
thai-word-mode defined in lisp/language/thai-util.el and friends.

The idea is that an entire dictionary of words are provided in a nested
char table, and then a minor mode both remaps most word-related commands
to use that dictionary, and fill-find-break-point-function is rewired to
do the same. The Thai version looks like this:

(define-minor-mode thai-word-mode
  :global t :group 'mule
  (cond (thai-word-mode
	 ;; This enables linebreak between Thai characters.
	 (modify-category-entry (make-char 'thai-tis620) ?|)
	 ;; This enables linebreak at a Thai word boundary.
	 (put-charset-property 'thai-tis620 'fill-find-break-point-function
			       'thai-fill-find-break-point))
	(t
	 (modify-category-entry (make-char 'thai-tis620) ?| nil t)
	 (put-charset-property 'thai-tis620 'fill-find-break-point-function
			       nil))))

I have shamelessly copied most of the code, and begun reworking it for
Chinese. But I'm confused about the charset specifications above.

Thai has only two charsets (one of which is thai-tis620), while Chinese
has more than a dozen (though I'm only messing with simplified Chinese
for now, so call it six or so).

My buffers are utf-8 encoded, and describe-char on a Chinese character
shows "preferred charset: unicode-bmp". So what do I put for the charset
in order to make these functions target the right characters? Chinese
characters all seem to have the "|" line-breakable category by default,
but (I think) I can only add the custom fill break point function one
charset at a time.

Thanks!
Eric

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: chinese word mode
  2013-11-05  9:11 chinese word mode Eric Abrahamsen
@ 2013-11-06  6:59 ` Eric Abrahamsen
  2013-11-06 13:36   ` Stefan Monnier
  2013-11-06 15:37   ` William Xu
  0 siblings, 2 replies; 9+ messages in thread
From: Eric Abrahamsen @ 2013-11-06  6:59 UTC (permalink / raw)
  To: emacs-devel

Eric Abrahamsen <eric@ericabrahamsen.net> writes:

[...]

> (define-minor-mode thai-word-mode
>   :global t :group 'mule
>   (cond (thai-word-mode
> 	 ;; This enables linebreak between Thai characters.
> 	 (modify-category-entry (make-char 'thai-tis620) ?|)
> 	 ;; This enables linebreak at a Thai word boundary.
> 	 (put-charset-property 'thai-tis620 'fill-find-break-point-function
> 			       'thai-fill-find-break-point))
> 	(t
> 	 (modify-category-entry (make-char 'thai-tis620) ?| nil t)
> 	 (put-charset-property 'thai-tis620 'fill-find-break-point-function
> 			       nil))))
>

[...]

> My buffers are utf-8 encoded, and describe-char on a Chinese character
> shows "preferred charset: unicode-bmp". So what do I put for the charset
> in order to make these functions target the right characters? Chinese
> characters all seem to have the "|" line-breakable category by default,
> but (I think) I can only add the custom fill break point function one
> charset at a time.

I've tried slapping the 'fill-find-break-point-function onto the
'unicode charset for now, and it works fine because the function only
does anything if point is in the midst of Chinese. It presumably gets
applied to all characters, though, and that can't be a real solution.

I'm guessing I'll need to separate simplified and traditional word sets
and make two versions of the mode. Both modes will loop through their
applicable charsets and apply/remove the custom break point function.

Assuming I fix this problem and other inevitable bugs, would this
library be of general interest to Emacs? The dictionary comes from the
(relatively authoritative) CC-CEDCIT project[1], which is licensed under
the Creative Commons Attribution-Share Alike 3.0 License. I've lopped
off some non-applicable dictionary entries, and everything over four
characters long, since those are usually compound phrasal entries.

Eric

[1] http://cc-cedict.org/wiki/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: chinese word mode
  2013-11-06  6:59 ` Eric Abrahamsen
@ 2013-11-06 13:36   ` Stefan Monnier
  2013-11-07 12:15     ` Kenichi Handa
  2013-11-06 15:37   ` William Xu
  1 sibling, 1 reply; 9+ messages in thread
From: Stefan Monnier @ 2013-11-06 13:36 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: Eric Abrahamsen, emacs-devel

> Assuming I fix this problem and other inevitable bugs, would this
> library be of general interest to Emacs? The dictionary comes from the

Handa?  Any comment on this suggestion?


        Stefan



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: chinese word mode
  2013-11-06  6:59 ` Eric Abrahamsen
  2013-11-06 13:36   ` Stefan Monnier
@ 2013-11-06 15:37   ` William Xu
  2013-11-07  7:13     ` Eric Abrahamsen
  1 sibling, 1 reply; 9+ messages in thread
From: William Xu @ 2013-11-06 15:37 UTC (permalink / raw)
  To: emacs-devel

Eric Abrahamsen <eric@ericabrahamsen.net> writes:

> Eric Abrahamsen <eric@ericabrahamsen.net> writes:
>
> [...]
>
>> (define-minor-mode thai-word-mode
>>   :global t :group 'mule
>>   (cond (thai-word-mode
>> 	 ;; This enables linebreak between Thai characters.
>> 	 (modify-category-entry (make-char 'thai-tis620) ?|)
>> 	 ;; This enables linebreak at a Thai word boundary.
>> 	 (put-charset-property 'thai-tis620 'fill-find-break-point-function
>> 			       'thai-fill-find-break-point))
>> 	(t
>> 	 (modify-category-entry (make-char 'thai-tis620) ?| nil t)
>> 	 (put-charset-property 'thai-tis620 'fill-find-break-point-function
>> 			       nil))))
>>
>
> [...]
>
>> My buffers are utf-8 encoded, and describe-char on a Chinese character
>> shows "preferred charset: unicode-bmp". So what do I put for the charset
>> in order to make these functions target the right characters? Chinese
>> characters all seem to have the "|" line-breakable category by default,
>> but (I think) I can only add the custom fill break point function one
>> charset at a time.
>
> I've tried slapping the 'fill-find-break-point-function onto the
> 'unicode charset for now, and it works fine because the function only
> does anything if point is in the midst of Chinese. It presumably gets
> applied to all characters, though, and that can't be a real solution.

modify-category-entry also accepts a range cons, where you can select
Chinese characters by range.  For example,

     (#x3400 . #x4DBF)                    ; CJK Unified Ideographs Extension A
     (#x4E00 . #x9FFF)                    ; CJK Unified Ideographs
     (#xF900 . #xFAFF)                    ; CJK Compatibility Ideographs

put-charset-property seems only accepts a charset..

> I'm guessing I'll need to separate simplified and traditional word sets
> and make two versions of the mode. Both modes will loop through their
> applicable charsets and apply/remove the custom break point function.
>
> Assuming I fix this problem and other inevitable bugs, would this
> library be of general interest to Emacs?

It can make those word movement functions useful.  :)

-- 
William

http://xwl.appspot.com




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: chinese word mode
  2013-11-06 15:37   ` William Xu
@ 2013-11-07  7:13     ` Eric Abrahamsen
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Abrahamsen @ 2013-11-07  7:13 UTC (permalink / raw)
  To: emacs-devel

William Xu <william.xwl@gmail.com> writes:

> Eric Abrahamsen <eric@ericabrahamsen.net> writes:
>
>> Eric Abrahamsen <eric@ericabrahamsen.net> writes:
>>
>> [...]
>>
>>> (define-minor-mode thai-word-mode
>>>   :global t :group 'mule
>>>   (cond (thai-word-mode
>>> 	 ;; This enables linebreak between Thai characters.
>>> 	 (modify-category-entry (make-char 'thai-tis620) ?|)
>>> 	 ;; This enables linebreak at a Thai word boundary.
>>> 	 (put-charset-property 'thai-tis620 'fill-find-break-point-function
>>> 			       'thai-fill-find-break-point))
>>> 	(t
>>> 	 (modify-category-entry (make-char 'thai-tis620) ?| nil t)
>>> 	 (put-charset-property 'thai-tis620 'fill-find-break-point-function
>>> 			       nil))))
>>>
>>
>> [...]
>>
>>> My buffers are utf-8 encoded, and describe-char on a Chinese character
>>> shows "preferred charset: unicode-bmp". So what do I put for the charset
>>> in order to make these functions target the right characters? Chinese
>>> characters all seem to have the "|" line-breakable category by default,
>>> but (I think) I can only add the custom fill break point function one
>>> charset at a time.
>>
>> I've tried slapping the 'fill-find-break-point-function onto the
>> 'unicode charset for now, and it works fine because the function only
>> does anything if point is in the midst of Chinese. It presumably gets
>> applied to all characters, though, and that can't be a real solution.
>
> modify-category-entry also accepts a range cons, where you can select
> Chinese characters by range.  For example,
>
>      (#x3400 . #x4DBF)                    ; CJK Unified Ideographs Extension A
>      (#x4E00 . #x9FFF)                    ; CJK Unified Ideographs
>      (#xF900 . #xFAFF)                    ; CJK Compatibility Ideographs
>
> put-charset-property seems only accepts a charset..
>
>> I'm guessing I'll need to separate simplified and traditional word sets
>> and make two versions of the mode. Both modes will loop through their
>> applicable charsets and apply/remove the custom break point function.
>>
>> Assuming I fix this problem and other inevitable bugs, would this
>> library be of general interest to Emacs?
>
> It can make those word movement functions useful.  :)

That's certainly the idea! I'll admit I was motivated to do this by
using LibreOffice, which I usually can't stand, and noticing it DTRT
with Chinese words. A bit of Emacs chauvanism kicked in...

Thanks for the tips on categories and all. I don't think I need the
modify-category-entry section at all, since Chinese characters have the
"|" category by default. So it's just looping on applicable charsets.

E




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: chinese word mode
  2013-11-06 13:36   ` Stefan Monnier
@ 2013-11-07 12:15     ` Kenichi Handa
  2013-11-08  3:36       ` Eric Abrahamsen
  0 siblings, 1 reply; 9+ messages in thread
From: Kenichi Handa @ 2013-11-07 12:15 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: eric, emacs-devel

In article <jwv4n7pbqlt.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> > Assuming I fix this problem and other inevitable bugs, would this
> > library be of general interest to Emacs? The dictionary comes from the

> Handa?  Any comment on this suggestion?

I agree that such a feature is useful for Chinese users.
But I have one question.

> The idea is that an entire dictionary of words are provided in a nested
> char table, and then a minor mode both remaps most word-related commands
> to use that dictionary, and fill-find-break-point-function is rewired to
> do the same.

I understand that such commands as M-f and M-d will get more
convenient on Chiense text, but I don't understandd the
latter part; i.e. the need for working on
fill-find-break-point-function.  As far as I know, Chinese
text (as well as Japanese text) can be broken at any point
except for "kinsoku" processing.  So there's no need to
change the current behavior as to line-breaking.  Am I
missing something?

---
Kenichi Handa
handa@gnu.org

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: chinese word mode
  2013-11-07 12:15     ` Kenichi Handa
@ 2013-11-08  3:36       ` Eric Abrahamsen
  2013-11-08 23:03         ` Xue Fuqiao
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Abrahamsen @ 2013-11-08  3:36 UTC (permalink / raw)
  To: emacs-devel; +Cc: Stefan Monnier

Kenichi Handa <handa@gnu.org> writes:

> In article <jwv4n7pbqlt.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:
>
>> > Assuming I fix this problem and other inevitable bugs, would this
>> > library be of general interest to Emacs? The dictionary comes from the
>
>> Handa?  Any comment on this suggestion?
>
> I agree that such a feature is useful for Chinese users.
> But I have one question.
>
>> The idea is that an entire dictionary of words are provided in a nested
>> char table, and then a minor mode both remaps most word-related commands
>> to use that dictionary, and fill-find-break-point-function is rewired to
>> do the same.
>
> I understand that such commands as M-f and M-d will get more
> convenient on Chiense text, but I don't understandd the
> latter part; i.e. the need for working on
> fill-find-break-point-function.  As far as I know, Chinese
> text (as well as Japanese text) can be broken at any point
> except for "kinsoku" processing.  So there's no need to
> change the current behavior as to line-breaking.  Am I
> missing something?

Huh, interesting -- I'd been thinking entirely in terms of making
Chinese editing easier on the eyes, rather than Chinese typographical
conventions. But you're right, breaking words is perfectly okay.

I can add a chinese-word-enable-kinsoku option, and then add the "<" and
">" categories to the characters that need them.

Eric




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: chinese word mode
  2013-11-08  3:36       ` Eric Abrahamsen
@ 2013-11-08 23:03         ` Xue Fuqiao
  2013-11-09  2:51           ` Eric Abrahamsen
  0 siblings, 1 reply; 9+ messages in thread
From: Xue Fuqiao @ 2013-11-08 23:03 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: emacs-devel

Looks interesting, although sometimes it can be ambiguous.

For example:

* `化妆和服装' can split into either `化妆 和 服装' or `化妆 和服 装';
* In `这个门把手坏了', `把手' is a word, but in `请把手拿开', `把手' is not a word;
* In `将军任命了一名中将', `中将' is a word, but in `产量三年中将增长两倍', `中将' isn't a
word any more.

How do you solve this problem?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: chinese word mode
  2013-11-08 23:03         ` Xue Fuqiao
@ 2013-11-09  2:51           ` Eric Abrahamsen
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Abrahamsen @ 2013-11-09  2:51 UTC (permalink / raw)
  To: emacs-devel

Xue Fuqiao <xfq.free@gmail.com> writes:

> Looks interesting, although sometimes it can be ambiguous.
>
> For example:
>
> * `化妆和服装' can split into either `化妆 和 服装' or `化妆 和服 装';
> * In `这个门把手坏了', `把手' is a word, but in `请把手拿开', `把手' is not a word;
> * In `将军任命了一名中将', `中将' is a word, but in `产量三年中将增长两倍', `中将' isn't a
> word any more.
>
> How do you solve this problem?

Short answer: you don't! When I first started looking at this issue, I
was considering all kinds of complicated solutions involving external
libraries by people smarter than me, who presumably had a system for
syntactical analysis.

Then someone pointed me at thai-word.el, which takes the "dumb"
approach -- scanning forward for the longest string in a word list --
and I realized if I was going to get anything actually completed, this
would have to do.

Mainly the point is making navigation through Chinese prose a little
less annoying, not producing a "correct" solution. As you note, there
will always be ambiguities that can't be resolved.

Eric

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-11-09  2:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-05  9:11 chinese word mode Eric Abrahamsen
2013-11-06  6:59 ` Eric Abrahamsen
2013-11-06 13:36   ` Stefan Monnier
2013-11-07 12:15     ` Kenichi Handa
2013-11-08  3:36       ` Eric Abrahamsen
2013-11-08 23:03         ` Xue Fuqiao
2013-11-09  2:51           ` Eric Abrahamsen
2013-11-06 15:37   ` William Xu
2013-11-07  7:13     ` Eric Abrahamsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).