word boundaries in Asian languages

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* word boundaries in Asian languages
@ 2013-08-19 10:26 Eric Abrahamsen
  2013-08-19 16:23 ` Eli Zaretskii
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Abrahamsen @ 2013-08-19 10:26 UTC (permalink / raw)
  To: help-gnu-emacs

I use emacs for prose more than for programming, and I've been idly
fiddling with making it a better environment for editing other
languages, particularly Asian languages, particularly Chinese prose.

One of the really awkward things about editing Chinese prose in Emacs is
that word boundaries are bound to spaces -- in a language that doesn't
use spaces to delineate words, movement and editing commands are thus
restricted either to per-character, or per-punctuated-phrase. It's
unwieldy.

Accurately identifying word boundaries in Chinese is a subject of
academic research, but a couple of C libraries have emerged (I've pasted
a couple of likely links at the bottom).

Given that this level of programming is _way_ above my pay grade, I
raise the following totally hypothetical scenario. How likely is this:

1. I call "forward-word" (or some equivalent word-based command)
2. Emacs checks a variable like use-multilingual-words, or something to 
   that makes all the following optional.
3. It's true, so we check the script of the following character, and try
   a lookup in a variable that pairs scripts with C libraries that
   provide word-level commands for those scripts.
4. A library is present! Instead of the usual "forward-word", we now
   call a function from that library to identify the next word boundary.
   Point goes either to that spot, or to the end of a contiguous run of
   characters of the same script that we started in.

So external C libraries would have to be augmented with functions that
did word boundary location in a way that made sense to emacs, but
presumably the hard work would have already been done. Given my general
ignorance, how unlikely is all of this?

Thanks!
Eric

http://technology.chtsai.org/mmseg/
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.8593

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: word boundaries in Asian languages
  2013-08-19 10:26 word boundaries in Asian languages Eric Abrahamsen
@ 2013-08-19 16:23 ` Eli Zaretskii
  2013-08-19 17:22   ` Thien-Thi Nguyen
  0 siblings, 1 reply; 5+ messages in thread
From: Eli Zaretskii @ 2013-08-19 16:23 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Mon, 19 Aug 2013 18:26:20 +0800
> 
> I use emacs for prose more than for programming, and I've been idly
> fiddling with making it a better environment for editing other
> languages, particularly Asian languages, particularly Chinese prose.
> 
> One of the really awkward things about editing Chinese prose in Emacs is
> that word boundaries are bound to spaces -- in a language that doesn't
> use spaces to delineate words, movement and editing commands are thus
> restricted either to per-character, or per-punctuated-phrase. It's
> unwieldy.
> 
> Accurately identifying word boundaries in Chinese is a subject of
> academic research, but a couple of C libraries have emerged (I've pasted
> a couple of likely links at the bottom).
> 
> Given that this level of programming is _way_ above my pay grade, I
> raise the following totally hypothetical scenario. How likely is this:

The right place to discuss this is emacs-devel, not here.

> 1. I call "forward-word" (or some equivalent word-based command)
> 2. Emacs checks a variable like use-multilingual-words, or something to 
>    that makes all the following optional.
> 3. It's true, so we check the script of the following character, and try
>    a lookup in a variable that pairs scripts with C libraries that
>    provide word-level commands for those scripts.
> 4. A library is present! Instead of the usual "forward-word", we now
>    call a function from that library to identify the next word boundary.
>    Point goes either to that spot, or to the end of a contiguous run of
>    characters of the same script that we started in.
> 
> So external C libraries would have to be augmented with functions that
> did word boundary location in a way that made sense to emacs, but
> presumably the hard work would have already been done. Given my general
> ignorance, how unlikely is all of this?

A couple of comments:

 . we already have large dictionaries in Emacs, the ones Leim input
   methods use; perhaps they could serve double duty for matching text
   to words

 . there's also UAX#29 (http://www.unicode.org/reports/tr29/)



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: word boundaries in Asian languages
  2013-08-19 16:23 ` Eli Zaretskii
@ 2013-08-19 17:22   ` Thien-Thi Nguyen
  2013-08-20  1:11     ` Eric Abrahamsen
  0 siblings, 1 reply; 5+ messages in thread
From: Thien-Thi Nguyen @ 2013-08-19 17:22 UTC (permalink / raw)
  To: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 712 bytes --]

() Eli Zaretskii <eliz@gnu.org>
() Mon, 19 Aug 2013 19:23:04 +0300

   The right place to discuss this is emacs-devel, not here.

Emacs provides var ‘find-word-boundary-function-table’, used in
Capitalized Words Mode (see lisp/progmodes/cap-words.el).  There are
several "subword" modes, Thai Word Mode (lisp/language/thai-util.el),
etc.

Perhaps some ideas and techniques there can be used in this context by
normal users (with adventurous spirit :-D) here, as well.

-- 
Thien-Thi Nguyen
   GPG key: 4C807502
   (if you're human and you know it)
      read my lisp: (responsep (questions 'technical)
                               (not (via 'mailing-list)))
                     => nil

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: word boundaries in Asian languages
  2013-08-19 17:22   ` Thien-Thi Nguyen
@ 2013-08-20  1:11     ` Eric Abrahamsen
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Abrahamsen @ 2013-08-20  1:11 UTC (permalink / raw)
  To: help-gnu-emacs

Thien-Thi Nguyen <ttn@gnu.org> writes:

> () Eli Zaretskii <eliz@gnu.org>
> () Mon, 19 Aug 2013 19:23:04 +0300
>
>    The right place to discuss this is emacs-devel, not here.
>
> Emacs provides var ‘find-word-boundary-function-table’, used in
> Capitalized Words Mode (see lisp/progmodes/cap-words.el).  There are
> several "subword" modes, Thai Word Mode (lisp/language/thai-util.el),
> etc.
>
> Perhaps some ideas and techniques there can be used in this context by
> normal users (with adventurous spirit :-D) here, as well.

Whoa, no kidding. The language support for Thai does exactly what I was
thinking of (though in elisp, not C), and it does it by loading the
entire Thai language into memory!

I've already got all of Chinese loaded into memory for the wubi input
method, so theoretically something could be done there...

If I manage to do anything with this, I'll post to emacs-devel.

Thanks!
Eric

^ permalink raw reply	[flat|nested] 5+ messages in thread

[parent not found: <mailman.330.1376907966.10748.help-gnu-emacs@gnu.org>]

* Re: word boundaries in Asian languages
       [not found] <mailman.330.1376907966.10748.help-gnu-emacs@gnu.org>
@ 2013-08-25 13:51 ` Stefan Monnier
  0 siblings, 0 replies; 5+ messages in thread
From: Stefan Monnier @ 2013-08-25 13:51 UTC (permalink / raw)
  To: help-gnu-emacs

> Accurately identifying word boundaries in Chinese is a subject of
> academic research, but a couple of C libraries have emerged (I've pasted
> a couple of likely links at the bottom).

Note also that Emacs already uses some notion of "boundary" for Asian
scripts in its text-filling code (used in fill-paragraph).
I'm sure if you pose on emacs-devel you may learn even more and, who
knows, someone may already have done such a thing.

        Stefan

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-08-25 13:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-19 10:26 word boundaries in Asian languages Eric Abrahamsen
2013-08-19 16:23 ` Eli Zaretskii
2013-08-19 17:22   ` Thien-Thi Nguyen
2013-08-20  1:11     ` Eric Abrahamsen
     [not found] <mailman.330.1376907966.10748.help-gnu-emacs@gnu.org>
2013-08-25 13:51 ` Stefan Monnier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).